StarCluster - Mailing List Archive

Re: trouble with starting a large cluster

From: Chen, Fei [JRDUS] <no email>
Date: Tue, 6 Sep 2011 13:52:33 -0400

Indeed I agree StarCluster is one of the best existing solutions out
there for setting up grid computing on EC2. This weekend we were able to
run more than 3000 hours of simulation on EC2. This would have been
impossible without StarCluster. I did end up installing 0.92rc2, without
a hitch, and I was able to use the addnode feature to great effect.
Awesome.

To answer Rayson's question
>> I was hoping that wouldn't be the
>> case, but yesterday when I started with the 30 node
>> cluster, I noticed
>> SGE was not up and running even on nodes that came up
>> properly...

>If it happends again, can you at least check:

>1) the NFS mounts
>2) the sge deamons are running? (ps -elf|grep sge)
>3) if qstat and qhost work?

Last time this happened, none of 1 2 or 3 was working. After ssh into
the master node, I was hoping SGE would still be operable on nodes that
came up successfully, but it was not, nor was there NFS mounts. Hence I
was surprised, a bit. I was expecting StarCluster would skip those nodes
that failed to boot up after some timeout and then proceed to finishing
building the SGE layer... It seems to be a waste if failure in one node
would break the whole system, especially if I'd get charged for the
price of the whole cluster every time I restarted it... Of course with
0.92rc2 I should use the addnode command to grow my cluster one at a
time, (or follow the advice of restarting the failed instances, but
that's a lot of manual work...) A side point, it would be cool if the
addnode command takes an argument of how many additional nodes to start,
rather than the one it assumes now.

Cheers,

fei




-----Original Message-----
From: Rayson Ho [mailto:raysonlogin_at_yahoo.com]
Sent: Tuesday, September 06, 2011 1:20 PM
To: starcluster_at_mit.edu; Chen, Fei [JRDUS]
Subject: RE: [StarCluster] trouble with starting a large cluster

--- On Fri, 9/2/11, Chen, Fei [JRDUS] <FChen6_at_its.jnj.com> wrote:
> Thanks for getting back to me so quickly.

No problem -- I was looking for a reason to dig into the starcluster
code... and this problem seems to be interesting enough to spend the
time.

(I was looking for a solution to run Open Grid Scheduler on EC2...
instead of rolling my own, I found that starcluster is one of the best
existing solutions out there! IMO, starcluster is very mature in terms
of user features, but might need a bit more high availability & fault
tolerant features to scale to hundreds of instances, and I am hoping to
contribute a patch of two in this area -- however I will need to brush
up my Python coding a lot more :-D )


> Yes I was aware of the new addnode feature in 0.92, among
> other things, guess it's really time for me to upgrade!

Also, in 0.92 (rc2), instead of SSHing into each node serially, a
producer-consumer thread-pool is used, starcluster should be able to
start large clusters faster as well.

I believe it should be possible to install 2 or more starcluster
versions on the same local machine - in the end, starcluster is not
setting up the local machine. This way, it is much safer to test a new
version while having the working version intact.


> Lastly, is it the case that one node failing to come up
> would prevent the entire cluster from booting up?

I think we still need Justin to provide the most correct answer...

(Warning: I've only spent a full day reading various parts of
starcluster.)

>From my understanding of the code, the setup process DefaultClusterSetup
needs to ssh into each node to set something up (for example
_setup_sge() needs to install SGE by running "./inst_sge -m -x -auto" on
each node). If the ssh connection fails by throwing the
SSHConnectionError exception, then I think it could interrupt the setup
process of that node - but not to the point that the whole starcluster
would become unusable.

I am not sure if the SSHConnectionError exception is handled (I've never
used Python exception handling before...), but I think we should catch
the SSHConnectionError exception, and add all the failed nodes to a list
for tries after some pre-set time, and then run reboot_instances() to
restart all stuck nodes & redo the setup process for those nodes.


> I was hoping that wouldn't be the
> case, but yesterday when I started with the 30 node
> cluster, I noticed
> SGE was not up and running even on nodes that came up
> properly...

If it happends again, can you at least check:

1) the NFS mounts
2) the sge deamons are running? (ps -elf|grep sge)
3) if qstat and qhost work?

Rayson

=================================
Grid Engine / Open Grid Scheduler
http://gridscheduler.sourceforge.net


>
> Thanks again,
>
> fei
>
> -----Original Message-----
> From: Rayson Ho [mailto:raysonlogin_at_yahoo.com]
>
> Sent: Thursday, September 01, 2011 5:04 PM
> To: starcluster_at_mit.edu;
> Chen, Fei [JRDUS]
> Subject: Re: [StarCluster] trouble with starting a large
> cluster
>
> --- On Thu, 9/1/11, Chen, Fei [JRDUS] <FChen6_at_its.jnj.com>
> wrote:
> > #cli.py:1079 - ERROR - failed to connect to host
> > ec2-50-19-64-123.compute-1.amazonaws.com on port 22
> >
> > Looking at the AWS console, I could see all 30
> instances
> > were up and running. I even checked a few boot logs
> (e.g.
> > right click on an instance and choose the "Get System
> Log"
> > menu item), which all looked OK to me, granted I
> didn't
> > check all 30 logs...,
>
> Can you check if "ec2-50-19-64-123" is stuck??
>
> I believe once in a while, a VM on EC2 fails to startup...
> But rebooting
> the machine would work-around the issue. (May be hardware
> related or a
> bug in the EC2 provisioning layer.)
>
> http://mailman.mit.edu/pipermail/starcluster/2011-April/000703.html
>
> Rayson
>
> =================================
> Grid Engine / Open Grid Scheduler
> http://gridscheduler.sourceforge.net
>
> > maybe there is one instance having
> > trouble starting, like the above message suggesting...
> I'm
> > guessing this could be simply a timing-out issue but I
> don't
> > know if/where there's a place I can change this. Dose
> > StarCluster skip any instances that fail to come up?
> >
> > And I'm using 0.91.2. I was hoping not to have to
> upgrade
> > (yet) as I'm needing results fast and don't want to
> risk
> > breaking something during the upgrade. AWS gave me
> capacity
> > to run 400 instances, so I'm hoping this is an easily
> solved
> > problem and I would be able to use that capacity...
> >
> > Appreciate any help!
> >
> > fei
> >
> > _______________________________________________
> > StarCluster mailing list
> > StarCluster_at_mit.edu
> > http://mailman.mit.edu/mailman/listinfo/starcluster
> >
>
>
Received on Tue Sep 06 2011 - 13:52:35 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject