StarCluster - Mailing List Archive

Re: Start timeout for spot instances

From: David Erickson <no email>
Date: Mon, 19 Mar 2012 12:18:48 -0700

On 3/18/2012 1:17 PM, David Erickson wrote:
> Hi it would be great if there were some kind of timeout option for spot
> instances, ie if they aren't started by some deadline then shut down
> everything and return an error exit code. That way a script running
> starcluster could then re-try with regular ondemand instances if there
> is a deadline to getting some work done.

I should follow this up with some more details:

My workload ideally requires 50 spot instances running SGE jobs, I have
50 jobs so running them all in parallel at once is ideal since this is
one step in a serial process. This weekend I ran my scripts that use
StarCluster to setup a cluster and run jobs on it then tear it down,
etc. However it was unable to ever allocate the 50 machines and hung
there waiting for the SIR to become active for 8 hours and 5 hours
during two different sessions (primarily overnight). I did some reading
and apparently AWS will not launch any of the nodes in the group unless
it is able to launch all of them (which I find wrong because I tried a
25 node launch later and it launched 5 then hung on the remaining 20 for
an hour before I gave up). What would be ideal for me would be for
StarCluster to create a multi-zone cluster, possibly using load balance
as a base, since my key goal is 50 machines, and the network traffic
inbetween is insignificant. Presumably you would specify which zone
houses the master as it could have EBS attached to it that is then
shared over NFS to the other machines. Has there been any thought or
code headed toward enabling something like this?

Thanks,
David
Received on Mon Mar 19 2012 - 15:20:05 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject