I have a cluster of about 80 nodes, running sge jobs. 20 of my nodes were killed by Amazon when supply became tight. Can someone suggest a strategy for recovering? I don’t really know what to do, because when I do `qstat` it still shows jobs running on the zombie nodes. In addition, `starcluster lc` shows those dead instances as still part of the cluster. How do I recover gracefully from this situation, paying particular attention to making sure the affected jobs are resubmitted to other nodes?
Thanks for any advice!
best,
Cedar
Received on Fri Jul 24 2015 - 22:02:14 EDT
This archive was generated by
hypermail 2.3.0.