Re: [Starcluster] failed cluster / detached drive
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hi Dan,
Glad the new cluster is working. The ssh 'error' messages you see are
harmless. I added some improved error handling to my ssh.Connection
class that automatically prints a error if the remote command being
executed does not return 0. However, not everything that returns 0 is a
fatal error. Some things are executed regardless of if they fail or not
just to be safe such as the mount and "rm" command you see below.
I'm combing through the code and ignoring return values for things that
aren't mission critical. For instance, the mount command below is
failing because /dev/pts is already mounted which is not a big deal.
~Justin
On 04/30/2010 08:33 PM, Dan Yamins wrote:
> OK I just started a new cluster. I saw some weird things during the
> start-up:
>
> [snip part that was normal]
>>>> Setting up the cluster...
>>>> Mounting EBS volume vol-c3d927aa on /home...
>>>> Using private key /Users/dyamins/amazon/id_rsa-gsg-keypair (rsa)
>>>> Creating cluster user: gotdata
>>>> Using private key /Users/dyamins/amazon/id_rsa-gsg-keypair (rsa)
>>>> Configuring scratch space for user: gotdata
>>>> Configuring /etc/hosts on each node
>>>> Configuring NFS...
> ssh.py:245 - ERROR - command mount -t devpts none /dev/pts failed with
> status 32
> ssh.py:245 - ERROR - command mount -t devpts none /dev/pts failed with
> status 32
>>>> Configuring passwordless ssh for root
> ssh.py:245 - ERROR - command rm /root/.ssh/id_rsa* failed with status 1
>>>> Configuring passwordless ssh for user: gotdata
>>>> Using existing RSA ssh keys found for user: gotdata
>>>> Installing Sun Grid Engine...
>>>> Done Configuring Sun Grid Engine
>>>> Running plugin govlovePlugin
>
> The resulting cluster seems to be functioning normally (e.g. the volume
> is mounted and I can log in to the nodes as both root and CLUSTER_USER),
> and my SGE jobs seem to be working. I'll let you know if anything
> weird occurs. Also I'd like to understand the error on line 245 of
> ssh.py above.
>
> Thanks!
>
> Dan
>
>
>
> On Fri, Apr 30, 2010 at 8:20 PM, Dan Yamins <dyamins_at_gmail.com
> <mailto:dyamins_at_gmail.com>> wrote:
>
> Justin, I just had a strange situation where suddenly my cluster
> failed. here were the symptoms:
>
> 1) all my active ssh terminals timed out
> 2) i couldn't log back in as the CLUSTER_USER (I got the "permission
> denied (public key)" error -- though I could ssh in as root
> 3) the mounted EBS volume appears to have disappeared -- e.g. when
> I tried to cd to it from /root, it was reported as not existing.
> 4) the SGE "qstat" command failed to be recognized. (e.g. when i
> run "qstat -xml" as root I got an error in finding the qstat command.)
>
> It seems like my EBS drive might have detached ... but lots of
> things could have happened. Any thoughts?
>
> Anyway, I killed the cluster as i didn't want o keep paying for it.
> I'm starting another one now, and will let you know what the result
> it. If it happens again I'll keep the cluster up and let you know
> right away.
>
> Dan
>
>
>
>
> _______________________________________________
> Starcluster mailing list
> Starcluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
Comment: Using GnuPG with Mozilla -
http://enigmail.mozdev.org/
iEYEARECAAYFAkvcGHgACgkQ4llAkMfDcrkhiwCfVrrGCAxyzrKIPCyznIaoBd4R
ELAAn0Hh9JhCem48VHjzyPYIjrDEYoM3
=uy2K
-----END PGP SIGNATURE-----
Received on Sat May 01 2010 - 08:03:06 EDT
This archive was generated by
hypermail 2.3.0.