Re: Large cluster (125 nodes) launch failure

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

From: Joseph <Kyeong>
Date: Sat, 26 Mar 2011 11:19:50 +0000

Hi Rajat,

First of all, I thank you very much for your load balancer.
With that, I could save lots of money and time in my said experiments
with total five clusters, each with 25 instances. Within just three
days of running, the charge amounted to $4,650+ (out of research
grant, thanks to Amazon).

As for the issues with master node, it's very clear that the current
default behavior of load balancer (i.e., not terminating it) is right
one and, in fact, unavoidable in the current architecture of
StarCluster based on the SGE; as I mentioned, the master node is the
single point of contact in this regard, and I have no intention to
test the feature you suggested.

By the way, it's interesting to see what others are doing in this
direction (i.e., replacement of SGE or extension of it to Cloud
Computing). A bit of googling results in lots of discussions and
ongoing researches, small samples of which are as follows:

* Thread on simple batch queue system on EC2:
http://www.mail-archive.com/debian-science_at_lists.debian.org/msg03318.html

* Thread on GRID extension to cloud computing and U of Wisconsin's Condor:
https://groups.google.com/d/topic/cloud-computing/8Wvkv3IA5zo/discussion

* Research paper on this direction (one from scratch?):
http://www.academypublisher.com/proc/isecs10w/papers/isecs10wp317.pdf

Regards,
Joseph

On Fri, Mar 25, 2011 at 5:28 PM, Rajat Banerjee <rbanerj_at_fas.harvard.edu> wrote:
> See comments inline-
>
> On Fri, Mar 25, 2011 at 11:40 AM, Kyeong Soo (Joseph) Kim
> <kyeongsoo.kim_at_gmail.com> wrote:
>>
>> For instance, the implementation of load
>> balancing would be much simpler and better and, if needed, it can
>> completely terminate the whole instances.
>>
>> As for my own experience with 25-node clusters, I found out that the
>> load balancer did not terminate the master node, even though it
>> finished all assigned jobs; the master node is a single point of
>> contact and had to wait for all those jobs running in other nodes to
>> finish.
>>
>
> There is a variable in starcluster/balancers/sge/__init__.py
> called:
> #This would allow the master to be killed when the queue empties. UNTESTED.
> allow_master_kill = False
> That would kill the master once the job queue is empty. You can turn it to
> True and test it if you'd like.
> This raises some risks - when the master is killed, the cluster is no longer
> accessible, and your results may be lost (unless you were smart enough to
> put them on ebs). I kept it semi-hidden because of these risks. Since you're
> obviously interested, give it a try. I used it for a little while, and it
> was able to terminate the master node when the jobs were finished. Though
> the cluster tags, groups, etc still exist, they won't incur any charges. at
> some later date you'd still have to call 'starcluster stop <cluster_tag>.
>
> Best,
> Rajat
>
Received on Sat Mar 26 2011 - 07:19:51 EDT

This message: [ Message body ]
Next message: Joseph: "Re: Large cluster (125 nodes) launch failure"
Previous message: Rajat Banerjee: "Re: Large cluster (125 nodes) launch failure"
In reply to: Rajat Banerjee: "Re: Large cluster (125 nodes) launch failure"
Next in thread: Justin Riley: "Re: Large cluster (125 nodes) launch failure"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

This archive was generated by hypermail 2.3.0.

Re: Large cluster (125 nodes) launch failure

Search:

Sort all by:

Navigation

Re: Large cluster (125 nodes) launch failure

Search:

Sort all by:

Navigation