StarCluster - Mailing List Archive

Re: Large cluster (125 nodes) launch failure

From: Justin Riley <no email>
Date: Tue, 19 Apr 2011 00:57:53 -0400

Hi Adam,

On Mar 23, 2011, at 4:40 PM, Adam adamnkraut_at_gmail.com wrote:

> 1) EC2 instances occasionally won't come up with ssh *ever*. In that case you have to reboot the instance and it should work. This could be something like 1 in 100 instances or just an anomaly but it's worth noting. The workaround I used was to manually verify all nodes are running and port 22 is open then run starcluster start -x.


In case I didn't address this previously, there is now a --show-ssh-status option to the 'listclusters' command in the latest development code. This will show you which nodes have a running SSH daemon (SSH: Up or Down). If you run into an issue with a node's SSH never coming up please first try the 'restart' command. This will reboot all instances and wait for SSH to come up again:

$ starcluster restart mycluster

Alternatively you could use the new --show-ssh-status option to 'listclusters' to single out and manually restart a faulty instance, however, you this would need to be done outside of StarCluster using the Amazon web console (http://aws.amazon.com/console).

HTH,

~Justin
Received on Tue Apr 19 2011 - 00:57:59 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject