StarCluster - Mailing List Archive

Re: grid engine not initialize on gpu hvm image

From: Justin Riley <no email>
Date: Thu, 13 Sep 2012 13:29:35 -0400

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Jesse,

Sorry for the delay in responding but glad you figured out to use
all-Ubuntu AMIs for both HVM and non-HVM nodes. With that said keep in
mind that only HVM nodes are on the high speed network IIRC which means
all traffic between master and nodes (e.g. NFS) will be suboptimal
compared to the performance of an all HVM cluster.

~Justin


On 08/27/2012 05:59 PM, Jesse Lu wrote:
> Okay, figured out that using ami-999d49f0 for non-HVM master and
> ami-4583572c for HVM nodes makes SGE work well. It's my fault for
> not looking at the available public starcluster images carefully
> enough.
>
>
>
> On Mon, Aug 27, 2012 at 2:26 PM, Jesse Lu <jesselu_at_stanford.edu
> <mailto:jesselu_at_stanford.edu>> wrote:
>
> Sorry for the spam, but here's another follow-up.
>
> I found that this only happens when I use a non HVM-EBS AMI for
> the master, but an HWM-EBS for the master.
>
> This is probably because StarCluster copies the sge install from
> the master to the nodes, and this doesn't play nice when the nodes
> are CentOS based but the master is Ubuntu based.
>
> Any ideas for a work-around?
>
>
> On Mon, Aug 27, 2012 at 2:07 PM, Jesse Lu <jesselu_at_stanford.edu
> <mailto:jesselu_at_stanford.edu>> wrote:
>
> Follow-up,
>
> Here are the contents of the installation log file (for grid
> engine)
>
> cat
> /opt/sge6/default/common/install_logs/execd_install_node001_2012-08-27_14:04:29.log
>
>
>
> Your $SGE_ROOT directory: /opt/sge6
>
>
> Using cell: >default<
>
>
>
>
>
> Using local execd spool directory
> [/opt/sge6/default/spool/exec_spool_local]
>
> Creating local configuration for host >node001< sgeadmin_at_node001
> modified "node001" in configuration list Local configuration for
> host >node001< created.
>
> Host >master< already in submit host list! Host >node001< already
> in submit host list!
>
>
> starting sge_execd
>
>
> No modification because "node001" already exists in "hostlist" of
> "hostgroup" root_at_node001 modified "@allhosts" in host group list
> root_at_node001 modified "all.q" in cluster queue list
>
> got select error: Connection refused got select error: closing
> "node001/execd/1" Execd on host node001 is not started!
>
>
> On Mon, Aug 27, 2012 at 1:37 PM, Jesse Lu <jesselu_at_stanford.edu
> <mailto:jesselu_at_stanford.edu>> wrote:
>
> ami-12b6477b produces the folowing error on cluster startup
>
> !!! ERROR - command 'cd /opt/sge6 && TERM=rxvt ./inst_sge -x
> -noremote -auto ./ec2_sge.conf' failed with status 1
>
> I'm guessing the sge6 installation is faulty? Can anyone help?
> Thanks!
>
> Jesse
>
>
>
>
>
>
> _______________________________________________ StarCluster mailing
> list StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAlBSF/4ACgkQ4llAkMfDcrlSwwCbB5lJLmj4GY9rriY9jfxNdqO3
s2UAn13+cEYu9bCqx6jiAP/wuPdetm+D
=Dyis
-----END PGP SIGNATURE-----
Received on Thu Sep 13 2012 - 13:29:40 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject