StarCluster - Mailing List Archive

when spot instances die

From: Cedar McKay <no email>
Date: Fri, 24 Jul 2015 19:02:11 -0700

I have a cluster of about 80 nodes, running sge jobs. 20 of my nodes were killed by Amazon when supply became tight. Can someone suggest a strategy for recovering? I don’t really know what to do, because when I do `qstat` it still shows jobs running on the zombie nodes. In addition, `starcluster lc` shows those dead instances as still part of the cluster. How do I recover gracefully from this situation, paying particular attention to making sure the affected jobs are resubmitted to other nodes?


Thanks for any advice!

best,
Cedar
Received on Fri Jul 24 2015 - 22:02:14 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject