StarCluster - Mailing List Archive

Re: trouble with starting a large cluster

From: Rayson Ho <no email>
Date: Tue, 6 Sep 2011 10:19:45 -0700 (PDT)

--- On Fri, 9/2/11, Chen, Fei [JRDUS] <> wrote:
> Thanks for getting back to me so quickly.

No problem -- I was looking for a reason to dig into the starcluster code... and this problem seems to be interesting enough to spend the time.

(I was looking for a solution to run Open Grid Scheduler on EC2... instead of rolling my own, I found that starcluster is one of the best existing solutions out there! IMO, starcluster is very mature in terms of user features, but might need a bit more high availability & fault tolerant features to scale to hundreds of instances, and I am hoping to contribute a patch of two in this area -- however I will need to brush up my Python coding a lot more :-D )

> Yes I was aware of the new addnode feature in 0.92, among
> other things, guess it's really time for me to upgrade!

Also, in 0.92 (rc2), instead of SSHing into each node serially, a producer-consumer thread-pool is used, starcluster should be able to start large clusters faster as well.

I believe it should be possible to install 2 or more starcluster versions on the same local machine - in the end, starcluster is not setting up the local machine. This way, it is much safer to test a new version while having the working version intact.

> Lastly, is it the case that one node failing to come up
> would prevent the entire cluster from booting up?

I think we still need Justin to provide the most correct answer...

(Warning: I've only spent a full day reading various parts of starcluster.)

>From my understanding of the code, the setup process DefaultClusterSetup needs to ssh into each node to set something up (for example _setup_sge() needs to install SGE by running "./inst_sge -m -x -auto" on each node). If the ssh connection fails by throwing the SSHConnectionError exception, then I think it could interrupt the setup process of that node - but not to the point that the whole starcluster would become unusable.

I am not sure if the SSHConnectionError exception is handled (I've never used Python exception handling before...), but I think we should catch the SSHConnectionError exception, and add all the failed nodes to a list for tries after some pre-set time, and then run reboot_instances() to restart all stuck nodes & redo the setup process for those nodes.

> I was hoping that wouldn't be the
> case, but yesterday when I started with the 30 node
> cluster, I noticed
> SGE was not up and running even on nodes that came up
> properly...

If it happends again, can you at least check:

1) the NFS mounts
2) the sge deamons are running? (ps -elf|grep sge)
3) if qstat and qhost work?


Grid Engine / Open Grid Scheduler

> Thanks again,
> fei
> -----Original Message-----
> From: Rayson Ho []
> Sent: Thursday, September 01, 2011 5:04 PM
> To:;
> Chen, Fei [JRDUS]
> Subject: Re: [StarCluster] trouble with starting a large
> cluster
> --- On Thu, 9/1/11, Chen, Fei [JRDUS] <>
> wrote:
> > - ERROR - failed to connect to host
> > on port 22
> >
> > Looking at the AWS console, I could see all 30
> instances
> > were up and running. I even checked a few boot logs
> (e.g.
> > right click on an instance and choose the "Get System
> Log"
> > menu item), which all looked OK to me, granted I
> didn't
> > check all 30 logs...,
> Can you check if "ec2-50-19-64-123" is stuck??
> I believe once in a while, a VM on EC2 fails to startup...
> But rebooting
> the machine would work-around the issue. (May be hardware
> related or a
> bug in the EC2 provisioning layer.)
> Rayson
> =================================
> Grid Engine / Open Grid Scheduler
> > maybe there is one instance having
> > trouble starting, like the above message suggesting...
> I'm
> > guessing this could be simply a timing-out issue but I
> don't
> > know if/where there's a place I can change this. Dose
> > StarCluster skip any instances that fail to come up?
> >
> > And I'm using 0.91.2. I was hoping not to have to
> upgrade
> > (yet) as I'm needing results fast and don't want to
> risk
> > breaking something during the upgrade. AWS gave me
> capacity
> > to run 400 instances, so I'm hoping this is an easily
> solved
> > problem and I would be able to use that capacity...
> >
> > Appreciate any help!
> >
> > fei
> >
> > _______________________________________________
> > StarCluster mailing list
> >
> >
> >
Received on Tue Sep 06 2011 - 13:19:48 EDT
This archive was generated by hypermail 2.3.0.


Sort all by: