Re: Start timeout for spot instances
Hi David,
This is useful although it could wind up wasting several instance hours
if all but one spot instance fails to come up for example. A better
approach would be for the start command to monitor spot instance
requests up until a deadline at which point the spot requests which
haven't come up yet are cancelled and on-demand instances are requested
in their place. This would avoid having to write this logic in your
script completely and also enables a mode where you 'get as much spot as
you can' up until a specified timeout. What do you think?
I've created an issue to keep track of this:
http://web.mit.edu/star/cluster/issues/98
~Justin
On Mon, Mar 19, 2012 at 12:18:48PM -0700, David Erickson wrote:
> On 3/18/2012 1:17 PM, David Erickson wrote:
> > Hi it would be great if there were some kind of timeout option for spot
> > instances, ie if they aren't started by some deadline then shut down
> > everything and return an error exit code. That way a script running
> > starcluster could then re-try with regular ondemand instances if there
> > is a deadline to getting some work done.
>
> I should follow this up with some more details:
>
> My workload ideally requires 50 spot instances running SGE jobs, I have
> 50 jobs so running them all in parallel at once is ideal since this is
> one step in a serial process. This weekend I ran my scripts that use
> StarCluster to setup a cluster and run jobs on it then tear it down,
> etc. However it was unable to ever allocate the 50 machines and hung
> there waiting for the SIR to become active for 8 hours and 5 hours
> during two different sessions (primarily overnight). I did some reading
> and apparently AWS will not launch any of the nodes in the group unless
> it is able to launch all of them (which I find wrong because I tried a
> 25 node launch later and it launched 5 then hung on the remaining 20 for
> an hour before I gave up). What would be ideal for me would be for
> StarCluster to create a multi-zone cluster, possibly using load balance
> as a base, since my key goal is 50 machines, and the network traffic
> inbetween is insignificant. Presumably you would specify which zone
> houses the master as it could have EBS attached to it that is then
> shared over NFS to the other machines. Has there been any thought or
> code headed toward enabling something like this?
>
> Thanks,
> David
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
- application/pgp-signature attachment: stored
Received on Thu Mar 22 2012 - 17:08:25 EDT
This archive was generated by
hypermail 2.3.0.