StarCluster - Mailing List Archive

Re: Large cluster (125 nodes) launch failure

From: Justin Riley <no email>
Date: Tue, 15 Mar 2011 18:29:49 -0400

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Joseph,

Have a look here for Adam Kraut's account of launching an 80 node
StarCluster:

http://mailman.mit.edu/pipermail/starcluster/2010-December/000552.html

It's *definitely* a matter of a huge delay involved in setting up a
large number of nodes so be patient. This is because StarCluster was
originally intended for 10-20 nodes and uses a fairly naive approach of
waiting for all nodes to come up (including ssh) and then setting up
each node one-by-one. I have several ideas on how to substantially speed
this process up in the future but for now I'm concentrating on getting a
release out within the next couple weeks.

However, after the release I'd be interested in profiling StarCluster
during such a large run and seeing where the slowness comes from in
detail. I'm definitely interested in getting StarCluster to a point
where it can launch large clusters in a reasonable amount of time. As it
stands now I need to drastically up my instance limit in order to test
and debug this properly. Would you be interested in profiling one of
your large runs and sending me the output some time after the next release?

~Justin



On 03/15/2011 06:03 PM, Kyeong Soo (Joseph) Kim wrote:
> Hi Austin,
>
> Yes, I requested to increase the limit to 300 (got a confirmation, of
> course) and now successfully running a 50-node cluster (it took 32
> mins BTW).
> I wonder now if it simply is a matter of huge delay involved with
> setting up such a large number nodes.
>
> Regards,
> Joseph
>
>
> On Tue, Mar 15, 2011 at 9:53 PM, Austin Godber <godber_at_uberhip.com> wrote:
>> Does it work at 20 and fail at 21? I think Amazon still has a 20 AMIs
>> limit, which you can request that they raise. Have you done that?
>>
>> http://aws.amazon.com/ec2/faqs/#How_many_instances_can_I_run_in_Amazon_EC2
>>
>> Austin
>>
>> On 03/15/2011 05:29 PM, Kyeong Soo (Joseph) Kim wrote:
>>> Hi Justin and All,
>>>
>>> This is to report a failure in launching a large cluster with 125
>>> nodes (c1.xlarge).
>>>
>>> I tried to launch the said cluster two times but starcluster hung (for
>>> more than hours) at the following steps:
>>>
>>> .....
>>>
>>>>>> Launching node121 (ami: ami-2857a641, type: c1.xlarge)
>>>>>> Launching node122 (ami: ami-2857a641, type: c1.xlarge)
>>>>>> Launching node123 (ami: ami-2857a641, type: c1.xlarge)
>>>>>> Launching node124 (ami: ami-2857a641, type: c1.xlarge)
>>>>>> Creating security group _at_sc-hnrlcluster...
>>> Reservation:r-7c264911
>>>>>> Waiting for cluster to come up... (updating every 30s)
>>>>>> Waiting for all nodes to be in a 'running' state...
>>> 125/125 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
>>> 100%
>>>>>> Waiting for SSH to come up on all nodes...
>>> 125/125 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
>>> 100%
>>>>>> The master node is ec2-75-101-230-197.compute-1.amazonaws.com
>>>>>> Setting up the cluster...
>>>>>> Attaching volume vol-467ecc2e to master node on /dev/sdz ...
>>>>>> Configuring hostnames...
>>>>>> Mounting EBS volume vol-467ecc2e on /home...
>>>>>> Creating cluster user: kks (uid: 1001, gid: 1001)
>>>>>> Configuring scratch space for user: kks
>>>>>> Configuring /etc/hosts on each node
>>>
>>> I have succeeded with the configuration up to 15 nodes so far.
>>>
>>> Any idea?
>>>
>>> With Regards,
>>> Joseph
>>> --
>>> Kyeong Soo (Joseph) Kim, Ph.D.
>>> Senior Lecturer in Networking
>>> Room 112, Digital Technium
>>> Multidisciplinary Nanotechnology Centre, College of Engineering
>>> Swansea University, Singleton Park, Swansea SA2 8PP, Wales UK
>>> TEL: +44 (0)1792 602024
>>> EMAIL: k.s.kim_at_swansea.ac.uk
>>> HOME: http://iat-hnrl.swan.ac.uk/ (group)
>>> http://iat-hnrl.swan.ac.uk/~kks/ (personal)
>>>
>>> _______________________________________________
>>> StarCluster mailing list
>>> StarCluster_at_mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>
>> _______________________________________________
>> StarCluster mailing list
>> StarCluster_at_mit.edu
>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1/6F0ACgkQ4llAkMfDcrnDXQCfap8Q42lXaahHbQu3No3pp5Nz
x3oAn2Rl50ImIxnWBxf9di1vaFgLm1AO
=wUaZ
-----END PGP SIGNATURE-----
Received on Tue Mar 15 2011 - 18:30:12 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject