Re: Large cluster (125 nodes) launch failure
Hi Adam,
On Mar 23, 2011, at 4:40 PM, Adam adamnkraut_at_gmail.com wrote:
> 1) EC2 instances occasionally won't come up with ssh *ever*. In that case you have to reboot the instance and it should work. This could be something like 1 in 100 instances or just an anomaly but it's worth noting. The workaround I used was to manually verify all nodes are running and port 22 is open then run starcluster start -x.
In case I didn't address this previously, there is now a --show-ssh-status option to the 'listclusters' command in the latest development code. This will show you which nodes have a running SSH daemon (SSH: Up or Down). If you run into an issue with a node's SSH never coming up please first try the 'restart' command. This will reboot all instances and wait for SSH to come up again:
$ starcluster restart mycluster
Alternatively you could use the new --show-ssh-status option to 'listclusters' to single out and manually restart a faulty instance, however, you this would need to be done outside of StarCluster using the Amazon web console (
http://aws.amazon.com/console).
HTH,
~Justin
Received on Tue Apr 19 2011 - 00:57:59 EDT
This archive was generated by
hypermail 2.3.0.