Re: Large cluster (125 nodes) launch failure

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

From: Joseph <Kyeong>
Date: Tue, 15 Mar 2011 22:36:59 +0000

Justin,

Thanks for the prompt and detailed response!

BTW it's not just a matter of delay.
I just responded to Jeff's e-mail regarding the partial installation
of SGE; like Jeff, I experienced that 21 out of 50 nodes are not
properly installed during the launching procedure and qhost shows only
29 execution nodes.

Now I am a bit at a loss; because I believed that all 50 nodes were
successfully running as SGE execution nodes, I submitted a huge array
jobs (total 1000). They are running on 29 nodes, but 21 others are
sitting idle.

What should I do with those 21 nodes?
Should I terminate the cluster (together with already running jobs)
and start with an even smaller configuration?

Regards,
Joseph

On Tue, Mar 15, 2011 at 10:29 PM, Justin Riley <jtriley_at_mit.edu> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi Joseph,
>
> Have a look here for Adam Kraut's account of launching an 80 node
> StarCluster:
>
> http://mailman.mit.edu/pipermail/starcluster/2010-December/000552.html
>
> It's *definitely* a matter of a huge delay involved in setting up a
> large number of nodes so be patient. This is because StarCluster was
> originally intended for 10-20 nodes and uses a fairly naive approach of
> waiting for all nodes to come up (including ssh) and then setting up
> each node one-by-one. I have several ideas on how to substantially speed
> this process up in the future but for now I'm concentrating on getting a
> release out within the next couple weeks.
>
> However, after the release I'd be interested in profiling StarCluster
> during such a large run and seeing where the slowness comes from in
> detail. I'm definitely interested in getting StarCluster to a point
> where it can launch large clusters in a reasonable amount of time. As it
> stands now I need to drastically up my instance limit in order to test
> and debug this properly. Would you be interested in profiling one of
> your large runs and sending me the output some time after the next release?
>
> ~Justin
>
>
>
> On 03/15/2011 06:03 PM, Kyeong Soo (Joseph) Kim wrote:
>> Hi Austin,
>>
>> Yes, I requested to increase the limit to 300 (got a confirmation, of
>> course) and now successfully running a 50-node cluster (it took 32
>> mins BTW).
>> I wonder now if it simply is a matter of huge delay involved with
>> setting up such a large number nodes.
>>
>> Regards,
>> Joseph
>>
>>
>> On Tue, Mar 15, 2011 at 9:53 PM, Austin Godber <godber_at_uberhip.com> wrote:
>>> Does it work at 20 and fail at 21? I think Amazon still has a 20 AMIs
>>> limit, which you can request that they raise. Have you done that?
>>>
>>> http://aws.amazon.com/ec2/faqs/#How_many_instances_can_I_run_in_Amazon_EC2
>>>
>>> Austin
>>>
>>> On 03/15/2011 05:29 PM, Kyeong Soo (Joseph) Kim wrote:
>>>> Hi Justin and All,
>>>>
>>>> This is to report a failure in launching a large cluster with 125
>>>> nodes (c1.xlarge).
>>>>
>>>> I tried to launch the said cluster two times but starcluster hung (for
>>>> more than hours) at the following steps:
>>>>
>>>> .....
>>>>
>>>>>>> Launching node121 (ami: ami-2857a641, type: c1.xlarge)
>>>>>>> Launching node122 (ami: ami-2857a641, type: c1.xlarge)
>>>>>>> Launching node123 (ami: ami-2857a641, type: c1.xlarge)
>>>>>>> Launching node124 (ami: ami-2857a641, type: c1.xlarge)
>>>>>>> Creating security group _at_sc-hnrlcluster...
>>>> Reservation:r-7c264911
>>>>>>> Waiting for cluster to come up... (updating every 30s)
>>>>>>> Waiting for all nodes to be in a 'running' state...
>>>> 125/125 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
>>>> 100%
>>>>>>> Waiting for SSH to come up on all nodes...
>>>> 125/125 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
>>>> 100%
>>>>>>> The master node is ec2-75-101-230-197.compute-1.amazonaws.com
>>>>>>> Setting up the cluster...
>>>>>>> Attaching volume vol-467ecc2e to master node on /dev/sdz ...
>>>>>>> Configuring hostnames...
>>>>>>> Mounting EBS volume vol-467ecc2e on /home...
>>>>>>> Creating cluster user: kks (uid: 1001, gid: 1001)
>>>>>>> Configuring scratch space for user: kks
>>>>>>> Configuring /etc/hosts on each node
>>>>
>>>> I have succeeded with the configuration up to 15 nodes so far.
>>>>
>>>> Any idea?
>>>>
>>>> With Regards,
>>>> Joseph
>>>> --
>>>> Kyeong Soo (Joseph) Kim, Ph.D.
>>>> Senior Lecturer in Networking
>>>> Room 112, Digital Technium
>>>> Multidisciplinary Nanotechnology Centre, College of Engineering
>>>> Swansea University, Singleton Park, Swansea SA2 8PP, Wales UK
>>>> TEL: +44 (0)1792 602024
>>>> EMAIL: k.s.kim_at_swansea.ac.uk
>>>> HOME: http://iat-hnrl.swan.ac.uk/ (group)
>>>> http://iat-hnrl.swan.ac.uk/~kks/ (personal)
>>>>
>>>> _______________________________________________
>>>> StarCluster mailing list
>>>> StarCluster_at_mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>
>>> _______________________________________________
>>> StarCluster mailing list
>>> StarCluster_at_mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>
>>
>> _______________________________________________
>> StarCluster mailing list
>> StarCluster_at_mit.edu
>> http://mailman.mit.edu/mailman/listinfo/starcluster
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v2.0.17 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iEYEARECAAYFAk1/6F0ACgkQ4llAkMfDcrnDXQCfap8Q42lXaahHbQu3No3pp5Nz
> x3oAn2Rl50ImIxnWBxf9di1vaFgLm1AO
> =wUaZ
> -----END PGP SIGNATURE-----
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
Received on Tue Mar 15 2011 - 18:37:00 EDT

This message: [ Message body ]
Next message: Joseph: "Re: ERROR - "Incorrect padding" in listclusters command (github version)"
Previous message: Joseph: "Re: starcluster starts but not all nodes added as exec nodes"
In reply to: Justin Riley: "Re: Large cluster (125 nodes) launch failure"
Next in thread: Joseph: "Re: Large cluster (125 nodes) launch failure"
Reply: Joseph: "Re: Large cluster (125 nodes) launch failure"
Reply: Justin Riley: "Re: Large cluster (125 nodes) launch failure"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Navigation

Re: Large cluster (125 nodes) launch failure

Search:

Sort all by:

Navigation