Re: trouble with starting a large cluster
This archive was generated by
In 0.92rc2, there's the addnode command, which would allow you to start from a small number of nodes and then grow the cluster.
"Adding and Removing Nodes from StarCluster"
Grid Engine / Open Grid Scheduler
--- On Thu, 9/1/11, Rayson Ho <raysonlogin_at_yahoo.com> wrote:
> > #cli.py:1079 - ERROR - failed to connect to host
> > ec2-50-19-64-123.compute-1.amazonaws.com on port 22
> > Looking at the AWS console, I could see all 30
> > were up and running. I even checked a few boot logs
> > right click on an instance and choose the "Get System
> > menu item), which all looked OK to me, granted I
> > check all 30 logs...,
> Can you check if "ec2-50-19-64-123" is stuck??
> I believe once in a while, a VM on EC2 fails to startup...
> But rebooting the machine would work-around the issue. (May
> be hardware related or a bug in the EC2 provisioning
> Grid Engine / Open Grid Scheduler
> > maybe there is one instance having
> > trouble starting, like the above message suggesting...
> > guessing this could be simply a timing-out issue but I
> > know if/where there's a place I can change this. Dose
> > StarCluster skip any instances that fail to come up?
> > And I'm using 0.91.2. I was hoping not to have to
> > (yet) as I'm needing results fast and don't want to
> > breaking something during the upgrade. AWS gave me
> > to run 400 instances, so I'm hoping this is an easily
> > problem and I would be able to use that capacity...
> > Appreciate any help!
> > fei
> > _______________________________________________
> > StarCluster mailing list
> > StarCluster_at_mit.edu
> > http://mailman.mit.edu/mailman/listinfo/starcluster
> StarCluster mailing list
Received on Thu Sep 01 2011 - 17:11:25 EDT