StarCluster - Mailing List Archive

Re: trouble with starting a large cluster

From: Rayson Ho <no email>
Date: Thu, 1 Sep 2011 14:04:27 -0700 (PDT)

--- On Thu, 9/1/11, Chen, Fei [JRDUS] <> wrote:
> - ERROR - failed to connect to host
> on port 22
> Looking at the AWS console, I could see all 30 instances
> were up and running. I even checked a few boot logs (e.g.
> right click on an instance and choose the "Get System Log"
> menu item), which all looked OK to me, granted I didn't
> check all 30 logs...,

Can you check if "ec2-50-19-64-123" is stuck??

I believe once in a while, a VM on EC2 fails to startup... But rebooting the machine would work-around the issue. (May be hardware related or a bug in the EC2 provisioning layer.)


Grid Engine / Open Grid Scheduler

> maybe there is one instance having
> trouble starting, like the above message suggesting... I'm
> guessing this could be simply a timing-out issue but I don't
> know if/where there's a place I can change this. Dose
> StarCluster skip any instances that fail to come up?
> And I'm using 0.91.2. I was hoping not to have to upgrade
> (yet) as I'm needing results fast and don't want to risk
> breaking something during the upgrade. AWS gave me capacity
> to run 400 instances, so I'm hoping this is an easily solved
> problem and I would be able to use that capacity...
> Appreciate any help!
> fei
> _______________________________________________
> StarCluster mailing list
Received on Thu Sep 01 2011 - 17:04:29 EDT
This archive was generated by hypermail 2.3.0.


Sort all by: