Hello Fellow Starclusterers!
We've been using the starcluster as an experiment for training some models for a couple of months and it's been great in use! Easy to set up and easy to use.
But now we are considering including it as a more permanent member into our tech stack and I'm looking more deeply into how it behaves in cases of failure, which is expected on Amazon. After some amount of Googling, I did find a few papers that describe clusters of large size (1000 nodes, even 10000 nodes) but somehow I found very little discussion about possible failures and recovery. We are also taking advantage of spot instances, so the failure of those is expected even more frequently than the regular "retail" nodes on Amazon.
I would appreciate very much, if someone on this list pointed me to any resources / documentation / discussion out there regarding what can be expected from Starcluster in cases of failure. Also, it would be great to know what other features might be on the drawing board, as we might be able to help build them!
Specifically, I'm trying to find answer to the following questions. Would appreciate very much any experiences or resources that anyone can share on this!
- If a node fails, is SGE/Starcluster able to detect this properly?
- What happens to the jobs running on the failed node? Are they retried? Can they be configured to be retired? Does this work reliably?
- What happens to SGE jobs if the master node dies?
- Can the cluster be recovered if the master node is restarted? Is is a single point of failure?
- If yes, does SGE itself support more redundancy than what is available as configured in Starcluster? Some diagrams in this presentation seems to imply so
http://beowulf.rutgers.edu/info-user/pdf/ge_presentation.pdf<
http://beowulf.rutgers.edu/info-user/pdf/ge_presentation.pdf)>
- If one uses Starcluster without the SGE, what is the behavior when master node dies? Can the cluster be recovered from this?
- What if we limit the use of NFS and instead use a separate system for data storage which provides its own high availability. Does this improve ability of the starcluster to recover from failure of nodes and the master?
Thanks very much for any information or anecdotes along these lines!
Best regards!
-Dmitry
Received on Fri Feb 07 2014 - 14:01:08 EST