StarCluster - Mailing List Archive

Re: [Starcluster] failed cluster / detached drive

From: Justin Riley <no email>
Date: Sat, 01 May 2010 08:00:06 -0400

Hash: SHA1


Wow that's bizarre. Not exactly sure why the EBS volume would have been
detached on you, haven't had that happen to me yet (fingers crossed). Of
course, the drive mounted on /home and being randomly detached will
cause the other symptoms you were having. Except I would still expect
root to be able to run qstat...hmmmm.

Unfortunately I don't have any ideas on this one other than this is some
random AWS failure with EBS.

> If it happens again I'll keep the cluster up and let you know right away.

Yes please do, that would be useful thanks!


On 04/30/2010 08:20 PM, Dan Yamins wrote:
> Justin, I just had a strange situation where suddenly my cluster
> failed. here were the symptoms:
> 1) all my active ssh terminals timed out
> 2) i couldn't log back in as the CLUSTER_USER (I got the "permission
> denied (public key)" error -- though I could ssh in as root
> 3) the mounted EBS volume appears to have disappeared -- e.g. when I
> tried to cd to it from /root, it was reported as not existing.
> 4) the SGE "qstat" command failed to be recognized. (e.g. when i run
> "qstat -xml" as root I got an error in finding the qstat command.)
> It seems like my EBS drive might have detached ... but lots of things
> could have happened. Any thoughts?
> Anyway, I killed the cluster as i didn't want o keep paying for it. I'm
> starting another one now, and will let you know what the result it. If
> it happens again I'll keep the cluster up and let you know right away.
> Dan
> _______________________________________________
> Starcluster mailing list

Version: GnuPG v2.0.14 (GNU/Linux)
Comment: Using GnuPG with Mozilla -

Received on Sat May 01 2010 - 08:00:10 EDT
This archive was generated by hypermail 2.3.0.


Sort all by: