This archive was generated by
I have a cluster of about 80 nodes, running sge jobs. 20 of my nodes were killed by Amazon when supply became tight. Can someone suggest a strategy for recovering? I don’t really know what to do, because when I do `qstat` it still shows jobs running on the zombie nodes. In addition, `starcluster lc` shows those dead instances as still part of the cluster. How do I recover gracefully from this situation, paying particular attention to making sure the affected jobs are resubmitted to other nodes?
Thanks for any advice!
Received on Fri Jul 24 2015 - 22:02:14 EDT