I do remember that SGE notices when compute nodes go down and can put
them into the error ('E') state.
So, I played a bit with it:
I started a cluster, submitted some jobs and shut down random nodes.
However, SGE did not really notice a problem. Only after a while the
jobs disappeared, but I got no error messages or anything.
I will try to figure out how SGE can be configured to notice compute
Along the same lines, I noticed that sometimes nodes come up with
quite a bit of time inbetween. So, the first node is already running
for a while until the second, third, ... all are coming up.
When you implement the options to add/remove nodes, maybe it is
possible to have the master and the nodes started independently.
Ideally, with spot instances as nodes these would register themselves
to the master and become available on their own. (Or the master
monitors instances coming up and going down).
Cheers, and once again: thanks for providing starcluster, it is really
saving a lot of work.
On Thu, Apr 15, 2010 at 23:36, Justin Riley <jtriley_at_mit.edu> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> Hi Thomas,
> This is indeed a great idea and I plan to incorporate spot instances to
> starcluster, however, not for this immediate release coming up.
> The first step would be to allow starcluster to launch spot instances
> via the start command given that I still haven't written an add/remove
> node command for starcluster. The add/remove node commands are planned
> but will take some time and significant testing before they're ready for
> public use.
> You're correct that jobs running on spot instances would be vulnerable
> to being arbitrarily terminated. I'm not sure if SGE would auto
> restart/migrate the jobs or not but their may be a way to configure it
> to do so. I'd simply need to play around with this by trying it on a
> test cluster and then seeing what's involved to configure SGE to
> migrate/restart failed jobs. If you have ideas or experience in this
> space I'd certainly love to hear them.
> Hope that helps,
> On 04/15/2010 04:07 PM, Thomas Deselaers wrote:
>> As I am finding new features in AWS, I come up with new ideas.
>> I have read on the mailinglist that adding and removing nodes to/from
>> a cluster is a planned feature.
>> Now I was wondering if this could be tied into the "spot instances".
>> So, I would start a normal instance as master and as many spot
>> instances as I get (for my price) as nodes.
>> This might lead to some jobs crashing when spot instances are killed
>> (without warning)... ideally, SGE would notice that the jobs crashed
>> and restart them.
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v2.0.14 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
> -----END PGP SIGNATURE-----
Received on Fri Apr 16 2010 - 05:20:26 EDT