StarCluster - Mailing List Archive

Re: 100 nodes cluster

From: Chi Chan <no email>
Date: Thu, 20 Oct 2011 12:46:27 -0400

On Thu, Oct 20, 2011 at 7:32 AM, Paolo Di Tommaso
<Paolo.DiTommaso_at_crg.eu> wrote:
> - Launching 100 nodes, the configuration requires ~ 30 minutes to complete;
> - Launching 200 nodes, it requires ~ 1 hour;
>
> Since our target is launching such as number of nodes to run jobs that may
> require around 1 hour to be completed, it would be meaningless to spend 50%
> or more of the time only to configure the system. The addnode command does
> not help because this process is even longer, since for each added node
> StarCluster need to update the /etc/hosts for each node.

I think the idea is to start some initial set of nodes, and submit the
jobs to SGE (or Open Grid Scheduler, or whatever it is called these
days). As the initial set of nodes are running jobs, then add more &
more nodes (may be 50 at a time?). If you just start an initial set of
nodes, and then add more nodes but without sending jobs to SGE first,
the overall start time is going to be even slower than starting all
nodes at once.

Ideally, StarCluster should do all that internally, but it is nice
that it can start 200 nodes now!

--Chi


>
> So the question is: would not be possible to use pre-configured node images,
> to shorten as much as possible to configuration steps (ideally only to the
> "/etc/hosts" files and the SGE updating) ?
>
> I'm thinking something similar to:
> 1) Launch a 2-node configuration.
> 2) Save the master and the node instances as two new separate AMI images.
> 3) Use these images as pre-configured machines to deploy a large cluster,
> updating the "hosts" files (and whatever else is needed).
> This would avoid to configure all the nodes from scratch and reduce the
> overall star-up time.
>
> Does it make sense? Is it possible in some way? Maybe using a custom plugin
> ?
>
> Cheers,
> Paolo Di Tommaso
> Software Engineer
> Comparative Bioinformatics Group
> Centre de Regulacio Genomica (CRG)
> Dr. Aiguader, 88
> 08003 Barcelona, Spain
>
>
>
>
>
> On Oct 17, 2011, at 5:59 PM, Rayson Ho wrote:
>
> 1) I agree with Matt, also a 20-node cluster should be relatively error free
> to bootstrap.
>
>
> 2) EC2 occasionally fails to start a node or 2 when requested to start a
> large number of nodes (instances), and I believe it has to do with how busy
> it is handling other requests as well. The best way to not overload EC2 is
> to start a few nodes at a time rather than the whole cluster all at once.
> In 0.92rc2, there is the addnode command:
> $ starcluster addnode mynewcluster
> The latest trunk introduces the ability to add multiple nodes, e.g. 3 nodes:
> $ starcluster addnode -n 3 mycluster
> So instead of starting a 100-node cluster during start-up, try starting a 20
> or 30-node one first, and then grow the cluster. For 0.92rc2, you may want
> to script the addnode command unless you enjoy typing :-D
>
>
> 3) I will do more scalability testing and hope to contribute scalability
> related improvements to StarCluster in the near future. I am waiting for the
> EBS based AMI so that I can start a large number of instances without
> breaking the bank - I am going to use my own AWS account, so I am interested
> in minimizing cost by using t1.micro (which is slower when running real
> work, but I am interesting in the launch speed of EC2 itself, so t1.micro
> seems to be perfect for my need!).
>
> https://github.com/jtriley/StarCluster/issues/52
> http://mailman.mit.edu/pipermail/starcluster/2011-October/000818.html
> (To Justin: no pressure in getting the EBS AMI, I will be busy till mid
> Nov).
> Rayson
> =================================
> Grid Engine / Open Grid Scheduler
> http://gridscheduler.sourceforge.net
>
>
> ________________________________
> From: Matthew Summers <quantumsummers_at_gentoo.org>
> To: "starcluster_at_mit.edu" <starcluster_at_mit.edu>
> Sent: Monday, October 17, 2011 10:58 AM
> Subject: Re: [StarCluster] 100 nodes cluster
>
> Are you guys running a versioned release or the HEAD on git. I am more
> than fairly certain this has been optimized in the repo, iirc a few
> months ago.
>
> --
> Matthew W. Summers
> Gentoo Foundation Inc.
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
> Barcelona, Spain
>>>
>>>
>>> _______________________________________________
>>> StarCluster mailing list
>>> StarCluster_at_mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>
>> --
>> Luis M. Carril
>> Project Technician
>> Galicia Supercomputing Center (CESGA)
>> Avda. de Vigo s/n
>> 15706 Santiago de Compostela
>> SPAIN
>>
>> Tel: 34-981569810 ext 249
>> lmcarril_at_cesga.es
>> www.cesga.es
>>
>>
>> ==================================================================
>>
>> _______________________________________________
>> StarCluster mailing list
>> StarCluster_at_mit.edu
>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>
>
> Are you guys running a versioned release or the HEAD on git. I am more
> than fairly certain this has been optimized in the repo, iirc a few
> months ago.
>
> --
> Matthew W. Summers
> Gentoo Foundation Inc.
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
>
Received on Thu Oct 20 2011 - 12:46:29 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject