Re: Start timeout for spot instances

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

From: Justin Riley <no email>
Date: Thu, 22 Mar 2012 17:08:20 -0400

Hi David,

This is useful although it could wind up wasting several instance hours
if all but one spot instance fails to come up for example. A better
approach would be for the start command to monitor spot instance
requests up until a deadline at which point the spot requests which
haven't come up yet are cancelled and on-demand instances are requested
in their place. This would avoid having to write this logic in your
script completely and also enables a mode where you 'get as much spot as
you can' up until a specified timeout. What do you think?

I've created an issue to keep track of this:

http://web.mit.edu/star/cluster/issues/98

~Justin

On Mon, Mar 19, 2012 at 12:18:48PM -0700, David Erickson wrote:
> On 3/18/2012 1:17 PM, David Erickson wrote:
> > Hi it would be great if there were some kind of timeout option for spot
> > instances, ie if they aren't started by some deadline then shut down
> > everything and return an error exit code. That way a script running
> > starcluster could then re-try with regular ondemand instances if there
> > is a deadline to getting some work done.
>
> I should follow this up with some more details:
>
> My workload ideally requires 50 spot instances running SGE jobs, I have
> 50 jobs so running them all in parallel at once is ideal since this is
> one step in a serial process. This weekend I ran my scripts that use
> StarCluster to setup a cluster and run jobs on it then tear it down,
> etc. However it was unable to ever allocate the 50 machines and hung
> there waiting for the SIR to become active for 8 hours and 5 hours
> during two different sessions (primarily overnight). I did some reading
> and apparently AWS will not launch any of the nodes in the group unless
> it is able to launch all of them (which I find wrong because I tried a
> 25 node launch later and it launched 5 then hung on the remaining 20 for
> an hour before I gave up). What would be ideal for me would be for
> StarCluster to create a multi-zone cluster, possibly using load balance
> as a base, since my key goal is 50 machines, and the network traffic
> inbetween is insignificant. Presumably you would specify which zone
> houses the master as it could have EBS attached to it that is then
> shared over NFS to the other machines. Has there been any thought or
> code headed toward enabling something like this?
>
> Thanks,
> David
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster

application/pgp-signature attachment: stored

Received on Thu Mar 22 2012 - 17:08:25 EDT

This message: [ Message body ]
Next message: Justin Riley: "Ubuntu 11.10 AMIs are available for download"
Previous message: Jeffrey Wan: "Re: StarCluster 0.93.3 Released!"
In reply to: David Erickson: "Re: Start timeout for spot instances"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Navigation

Re: Start timeout for spot instances

Search:

Sort all by:

Navigation