Re: Large cluster (125 nodes) launch failure

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

From: Rajat Banerjee <no email>
Date: Fri, 25 Mar 2011 13:28:00 -0400

See comments inline-

On Fri, Mar 25, 2011 at 11:40 AM, Kyeong Soo (Joseph) Kim <
kyeongsoo.kim_at_gmail.com> wrote:

> For instance, the implementation of load
> balancing would be much simpler and better and, if needed, it can
> completely terminate the whole instances.
>
> As for my own experience with 25-node clusters, I found out that the
> load balancer did not terminate the master node, even though it
> finished all assigned jobs; the master node is a single point of
> contact and had to wait for all those jobs running in other nodes to
> finish.
>
>
There is a variable in starcluster/balancers/sge/__init__.py
called:
#This would allow the master to be killed when the queue empties. UNTESTED.
allow_master_kill = False

That would kill the master once the job queue is empty. You can turn it to
True and test it if you'd like.

This raises some risks - when the master is killed, the cluster is no longer
accessible, and your results may be lost (unless you were smart enough to
put them on ebs). I kept it semi-hidden because of these risks. Since you're
obviously interested, give it a try. I used it for a little while, and it
was able to terminate the master node when the jobs were finished. Though
the cluster tags, groups, etc still exist, they won't incur any charges. at
some later date you'd still have to call 'starcluster stop <cluster_tag>.

Best,
Rajat
Received on Fri Mar 25 2011 - 13:28:21 EDT

This message: [ Message body ]
Next message: Joseph: "Re: Large cluster (125 nodes) launch failure"
Previous message: Matthew Summers: "Re: Large cluster (125 nodes) launch failure"
In reply to: Joseph: "Re: Large cluster (125 nodes) launch failure"
Next in thread: Joseph: "Re: Large cluster (125 nodes) launch failure"
Reply: Joseph: "Re: Large cluster (125 nodes) launch failure"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

This archive was generated by hypermail 2.3.0.

Re: Large cluster (125 nodes) launch failure

Search:

Sort all by:

Navigation

Re: Large cluster (125 nodes) launch failure

Search:

Sort all by:

Navigation