Re: Starcluster behavior when nodes fail, when master fails

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

From: Rayson Ho <no email>
Date: Fri, 7 Feb 2014 14:15:43 -0500

On Fri, Feb 7, 2014 at 2:01 PM, Dmitry Serenbrennikov
<dmitry_at_adchemy.com> wrote:
> - If a node fails, is SGE/Starcluster able to detect this properly?
>
> - What happens to the jobs running on the failed node? Are they retried? Can
> they be configured to be retired? Does this work reliably?

It depends on the "reschedule_unknown" & "max_unheard" parameters of
the SGE master:

http://gridscheduler.sourceforge.net/htmlman/htmlman5/sge_conf.html

> - What happens to SGE jobs if the master node dies?

They will continue to run to completion, but the status will not be
updated as the qmaster is the only host that can update the job
status.

> - Can the cluster be recovered if the master node is restarted? Is is a
> single point of failure?

Yes, as long as the job status (spool directory) is intact.

> - If yes, does SGE itself support more redundancy than what is available as
> configured in Starcluster? Some diagrams in this presentation seems to imply
> so http://beowulf.rutgers.edu/info-user/pdf/ge_presentation.pdf

Each SGE cluster can have one or more shadow masters, so SGE can fail
over to another instance:

http://gridscheduler.sourceforge.net/htmlman/htmlman8/sge_shadowd.html

> - If one uses Starcluster without the SGE, what is the behavior when master
> node dies? Can the cluster be recovered from this?

Without SGE, then a Starcluster is just a group of instances. (You can
of course use other schedulers like Condor.)

Assuming that you just want a group of instances, then failure of one
does not affect other healthy instances.

> - What if we limit the use of NFS and instead use a separate system for data
> storage which provides its own high availability. Does this improve ability
> of the starcluster to recover from failure of nodes and the master?

I believe you will need to update the EC2 "master" tag (log onto the
AWS Web management console, you will see the "master" tag there if you
have a StarCluster running), or else some of the StarCluster commands
(like starcluster sshmaster) won't work.

Also see "Reducing and Eliminating NFS usage by Grid Engine":
http://gridscheduler.sourceforge.net/howto/nfsreduce.html

Rayson

==================================================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/
http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html

>
>
> Thanks very much for any information or anecdotes along these lines!
>
> Best regards!
>
> -Dmitry
>
>
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
Received on Fri Feb 07 2014 - 14:15:46 EST

This message: [ Message body ]
Next message: Alessandro Gagliardi: "'module' object has no attribute 'SingleBlockManager'"
Previous message: Dmitry Serenbrennikov: "Starcluster behavior when nodes fail, when master fails"
In reply to: Dmitry Serenbrennikov: "Starcluster behavior when nodes fail, when master fails"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Navigation

Re: Starcluster behavior when nodes fail, when master fails

Search:

Sort all by:

Navigation