Re: trouble with starting a large cluster

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

From: Chen, Fei [JRDUS] <no email>
Date: Fri, 2 Sep 2011 11:25:55 -0400

Hi Rayson,

Thanks for getting back to me so quickly. I had shutdown the 30-node
grid so I couldn't check in the log whether ec2-50-19-64-123 was stuck,
but I did try at the time to ssh into it without success, so presumably
it did fail to boot up.

Yes I was aware of the new addnode feature in 0.92, among other things,
guess it's really time for me to upgrade!

Lastly, is it the case that one node failing to come up would prevent
the entire cluster from booting up? I was hoping that wouldn't be the
case, but yesterday when I started with the 30 node cluster, I noticed
SGE was not up and running even on nodes that came up properly...

Thanks again,

fei

-----Original Message-----
From: Rayson Ho [mailto:raysonlogin_at_yahoo.com]
Sent: Thursday, September 01, 2011 5:04 PM
To: starcluster_at_mit.edu; Chen, Fei [JRDUS]
Subject: Re: [StarCluster] trouble with starting a large cluster

--- On Thu, 9/1/11, Chen, Fei [JRDUS] <FChen6_at_its.jnj.com> wrote:
> #cli.py:1079 - ERROR - failed to connect to host
> ec2-50-19-64-123.compute-1.amazonaws.com on port 22
>
> Looking at the AWS console, I could see all 30 instances
> were up and running. I even checked a few boot logs (e.g.
> right click on an instance and choose the "Get System Log"
> menu item), which all looked OK to me, granted I didn't
> check all 30 logs...,

Can you check if "ec2-50-19-64-123" is stuck??

I believe once in a while, a VM on EC2 fails to startup... But rebooting
the machine would work-around the issue. (May be hardware related or a
bug in the EC2 provisioning layer.)

http://mailman.mit.edu/pipermail/starcluster/2011-April/000703.html

Rayson

=================================
Grid Engine / Open Grid Scheduler
http://gridscheduler.sourceforge.net

> maybe there is one instance having
> trouble starting, like the above message suggesting... I'm
> guessing this could be simply a timing-out issue but I don't
> know if/where there's a place I can change this. Dose
> StarCluster skip any instances that fail to come up?
>
> And I'm using 0.91.2. I was hoping not to have to upgrade
> (yet) as I'm needing results fast and don't want to risk
> breaking something during the upgrade. AWS gave me capacity
> to run 400 instances, so I'm hoping this is an easily solved
> problem and I would be able to use that capacity...
>
> Appreciate any help!
>
> fei
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
Received on Fri Sep 02 2011 - 11:25:57 EDT

This message: [ Message body ]
Next message: Rayson Ho: "Re: trouble with starting a large cluster"
Previous message: Rayson Ho: "Re: trouble with starting a large cluster"
In reply to: Rayson Ho: "Re: trouble with starting a large cluster"
Next in thread: Rayson Ho: "Re: trouble with starting a large cluster"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Navigation

Re: trouble with starting a large cluster

Search:

Sort all by:

Navigation