--- On Fri, 9/2/11, Chen, Fei [JRDUS] <FChen6_at_its.jnj.com> wrote:
> Thanks for getting back to me so quickly.
No problem -- I was looking for a reason to dig into the starcluster code... and this problem seems to be interesting enough to spend the time.
> Yes I was aware of the new addnode feature in 0.92, among
> other things, guess it's really time for me to upgrade!
Also, in 0.92 (rc2), instead of SSHing into each node serially, a producer-consumer thread-pool is used, starcluster should be able to start large clusters faster as well.
I believe it should be possible to install 2 or more starcluster versions on the same local machine - in the end, starcluster is not setting up the local machine. This way, it is much safer to test a new version while having the working version intact.
> Lastly, is it the case that one node failing to come up
> would prevent the entire cluster from booting up?
I think we still need Justin to provide the most correct answer...
(Warning: I've only spent a full day reading various parts of starcluster.)
>From my understanding of the code, the setup process DefaultClusterSetup needs to ssh into each node to set something up (for example _setup_sge() needs to install SGE by running "./inst_sge -m -x -auto" on each node). If the ssh connection fails by throwing the SSHConnectionError exception, then I think it could interrupt the setup process of that node - but not to the point that the whole starcluster would become unusable.
I am not sure if the SSHConnectionError exception is handled (I've never used Python exception handling before...), but I think we should catch the SSHConnectionError exception, and add all the failed nodes to a list for tries after some pre-set time, and then run reboot_instances() to restart all stuck nodes & redo the setup process for those nodes.
> I was hoping that wouldn't be the
> case, but yesterday when I started with the 30 node
> cluster, I noticed
> SGE was not up and running even on nodes that came up
If it happends again, can you at least check:
1) the NFS mounts
2) the sge deamons are running? (ps -elf|grep sge)
3) if qstat and qhost work?
Grid Engine / Open Grid Scheduler
> Thanks again,
> -----Original Message-----
> From: Rayson Ho [mailto:raysonlogin_at_yahoo.com]
> Sent: Thursday, September 01, 2011 5:04 PM
> To: starcluster_at_mit.edu;
> Chen, Fei [JRDUS]
> Subject: Re: [StarCluster] trouble with starting a large
> --- On Thu, 9/1/11, Chen, Fei [JRDUS] <FChen6_at_its.jnj.com>
> > #cli.py:1079 - ERROR - failed to connect to host
> > ec2-50-19-64-123.compute-1.amazonaws.com on port 22
> > Looking at the AWS console, I could see all 30
> > were up and running. I even checked a few boot logs
> > right click on an instance and choose the "Get System
> > menu item), which all looked OK to me, granted I
> > check all 30 logs...,
> Can you check if "ec2-50-19-64-123" is stuck??
> I believe once in a while, a VM on EC2 fails to startup...
> But rebooting
> the machine would work-around the issue. (May be hardware
> related or a
> bug in the EC2 provisioning layer.)
> Grid Engine / Open Grid Scheduler
> > maybe there is one instance having
> > trouble starting, like the above message suggesting...
> > guessing this could be simply a timing-out issue but I
> > know if/where there's a place I can change this. Dose
> > StarCluster skip any instances that fail to come up?
> > And I'm using 0.91.2. I was hoping not to have to
> > (yet) as I'm needing results fast and don't want to
> > breaking something during the upgrade. AWS gave me
> > to run 400 instances, so I'm hoping this is an easily
> > problem and I would be able to use that capacity...
> > Appreciate any help!
> > fei
> > _______________________________________________
> > StarCluster mailing list
> > StarCluster_at_mit.edu
> > http://mailman.mit.edu/mailman/listinfo/starcluster
Received on Tue Sep 06 2011 - 13:19:48 EDT