StarCluster - Mailing List Archive

Re: 100 nodes cluster

From: Paolo Di Tommaso <no email>
Date: Thu, 20 Oct 2011 13:32:37 +0200

Dear all,

Thank you for your feedback, it has been very useful. The new StarCluster release 0.92 solves most of the problems.

It is much more stable, and node I don't get any error launching large clusters (with 100 or more instances).

Anyway the overall process is still very slow and, above all, the time required seems to be linear with the number of the instances used.

For examples:

- Launching 100 nodes, the configuration requires ~ 30 minutes to complete;
- Launching 200 nodes, it requires ~ 1 hour;

Since our target is launching such as number of nodes to run jobs that may require around 1 hour to be completed, it would be meaningless to spend 50% or more of the time only to configure the system. The addnode command does not help because this process is even longer, since for each added node StarCluster need to update the /etc/hosts for each node.


So the question is: would not be possible to use pre-configured node images, to shorten as much as possible to configuration steps (ideally only to the "/etc/hosts" files and the SGE updating) ?


I'm thinking something similar to:

1) Launch a 2-node configuration.
2) Save the master and the node instances as two new separate AMI images.
3) Use these images as pre-configured machines to deploy a large cluster, updating the "hosts" files (and whatever else is needed).

This would avoid to configure all the nodes from scratch and reduce the overall star-up time.


Does it make sense? Is it possible in some way? Maybe using a custom plugin ?


Cheers,

Paolo Di Tommaso
Software Engineer
Comparative Bioinformatics Group
Centre de Regulacio Genomica (CRG)
Dr. Aiguader, 88
08003 Barcelona, Spain






On Oct 17, 2011, at 5:59 PM, Rayson Ho wrote:

1) I agree with Matt, also a 20-node cluster should be relatively error free to bootstrap.


2) EC2 occasionally fails to start a node or 2 when requested to start a large number of nodes (instances), and I believe it has to do with how busy it is handling other requests as well. The best way to not overload EC2 is to start a few nodes at a time rather than the whole cluster all at once.

In 0.92rc2, there is the addnode command:

$ starcluster addnode mynewcluster

The latest trunk introduces the ability to add multiple nodes, e.g. 3 nodes:

$ starcluster addnode -n 3 mycluster

So instead of starting a 100-node cluster during start-up, try starting a 20 or 30-node one first, and then grow the cluster. For 0.92rc2, you may want to script the addnode command unless you enjoy typing :-D


3) I will do more scalability testing and hope to contribute scalability related improvements to StarCluster in the near future. I am waiting for the EBS based AMI so that I can start a large number of instances without breaking the bank - I am going to use my own AWS account, so I am interested in minimizing cost by using t1.micro (which is slower when running real work, but I am interesting in the launch speed of EC2 itself, so t1.micro seems to be perfect for my need!).

https://github.com/jtriley/StarCluster/issues/52
http://mailman.mit.edu/pipermail/starcluster/2011-October/000818.html

(To Justin: no pressure in getting the EBS AMI, I will be busy till mid Nov).

Rayson

=================================
Grid Engine / Open Grid Scheduler
http://gridscheduler.sourceforge.net


________________________________
From: Matthew Summers <quantumsummers_at_gentoo.org<mailto:quantumsummers_at_gentoo.org>>
To: "starcluster_at_mit.edu<mailto:starcluster_at_mit.edu>" <starcluster_at_mit.edu<mailto:starcluster_at_mit.edu>>
Sent: Monday, October 17, 2011 10:58 AM
Subject: Re: [StarCluster] 100 nodes cluster

Are you guys running a versioned release or the HEAD on git. I am more
than fairly certain this has been optimized in the repo, iirc a few
months ago.

--
Matthew W. Summers
Gentoo Foundation Inc.
_______________________________________________
StarCluster mailing list
StarCluster_at_mit.edu<mailto:StarCluster_at_mit.edu>
http://mailman.mit.edu/mailman/listinfo/starcluster
Barcelona, Spain
>>
>>
>> _______________________________________________
>> StarCluster mailing list
>> StarCluster_at_mit.edu<mailto:StarCluster_at_mit.edu>
>> http://mailman.mit.edu/mailman/listinfo/starcluster
>
> --
> Luis M. Carril
> Project Technician
> Galicia Supercomputing Center (CESGA)
> Avda. de Vigo s/n
> 15706 Santiago de Compostela
> SPAIN
>
> Tel: 34-981569810 ext 249
> lmcarril_at_cesga.es<mailto:lmcarril_at_cesga.es>
> www.cesga.es<http://www.cesga.es/>
>
>
> ==================================================================
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu<mailto:StarCluster_at_mit.edu>
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
Are you guys running a versioned release or the HEAD on git. I am more
than fairly certain this has been optimized in the repo, iirc a few
months ago.
--
Matthew W. Summers
Gentoo Foundation Inc.
_______________________________________________
StarCluster mailing list
StarCluster_at_mit.edu<mailto:StarCluster_at_mit.edu>
http://mailman.mit.edu/mailman/listinfo/starcluster
_______________________________________________
StarCluster mailing list
StarCluster_at_mit.edu<mailto:StarCluster_at_mit.edu>
http://mailman.mit.edu/mailman/listinfo/starcluster
Received on Thu Oct 20 2011 - 07:32:48 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject