StarCluster - Mailing List Archive

Re: trouble with starting a large cluster

From: Rayson Ho <no email>
Date: Thu, 1 Sep 2011 14:04:27 -0700 (PDT)

--- On Thu, 9/1/11, Chen, Fei [JRDUS] <FChen6_at_its.jnj.com> wrote:
> #cli.py:1079 - ERROR - failed to connect to host
> ec2-50-19-64-123.compute-1.amazonaws.com on port 22
>
> Looking at the AWS console, I could see all 30 instances
> were up and running. I even checked a few boot logs (e.g.
> right click on an instance and choose the "Get System Log"
> menu item), which all looked OK to me, granted I didn't
> check all 30 logs...,

Can you check if "ec2-50-19-64-123" is stuck??

I believe once in a while, a VM on EC2 fails to startup... But rebooting the machine would work-around the issue. (May be hardware related or a bug in the EC2 provisioning layer.)

http://mailman.mit.edu/pipermail/starcluster/2011-April/000703.html

Rayson

=================================
Grid Engine / Open Grid Scheduler
http://gridscheduler.sourceforge.net

> maybe there is one instance having
> trouble starting, like the above message suggesting... I'm
> guessing this could be simply a timing-out issue but I don't
> know if/where there's a place I can change this. Dose
> StarCluster skip any instances that fail to come up?
>
> And I'm using 0.91.2. I was hoping not to have to upgrade
> (yet) as I'm needing results fast and don't want to risk
> breaking something during the upgrade. AWS gave me capacity
> to run 400 instances, so I'm hoping this is an easily solved
> problem and I would be able to use that capacity...
>
> Appreciate any help!
>
> fei
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
Received on Thu Sep 01 2011 - 17:04:29 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject