Re: Large cluster (125 nodes) launch failure

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

From: Joseph <Kyeong>
Date: Fri, 25 Mar 2011 15:40:29 +0000

Hi Adam,

Those are really valuable observations and I do appreciate that.

Related with your last comment (... move away from the SGE installer
...), by the way, I am wondering whether it's possible to replace the
"SGE on the master node" (on the AWS) with something else running on
our own PC (outside the AWS).

I know it's well beyond the current scope of the StarCluster project,
but found that there is some overlapping in functions between the SGE
on the master node and the StarCluster on our PC.

In this regard it would be really great if we could implement a
simpler batch queueing system on the StarCluster itself and therefore
do away with the SGE. For instance, the implementation of load
balancing would be much simpler and better and, if needed, it can
completely terminate the whole instances.

As for my own experience with 25-node clusters, I found out that the
load balancer did not terminate the master node, even though it
finished all assigned jobs; the master node is a single point of
contact and had to wait for all those jobs running in other nodes to
finish.

Regards,
Joseph

On Wed, Mar 23, 2011 at 8:40 PM, Adam <adamnkraut_at_gmail.com> wrote:
> Hi Joseph/Justin,
> I did have similar problems starting larger clusters. There was actually 2
> issues at play.
> 1) EC2 instances occasionally won't come up with ssh *ever*. In that case
> you have to reboot the instance and it should work. This could be something
> like 1 in 100 instances or just an anomaly but it's worth noting. The
> workaround I used was to manually verify all nodes are running and port 22
> is open then run starcluster start -x.
> 2) StarCluster runs all ssh commands in a single process. There's overhead
> for each ssh call so this can end up taking a really long time (30-60 mins)
> in some cases. The solution I think is for starcluster to push the ssh
> commands into a queue and use multiple processes/threads to run them.
> I'm happy to invest some time profiling this or adding parallel ssh support
> if there's interest. The latter may be best left to the pythonistas though
> ;)
> I think that might explain "partial" SGE installs. The effort to move away
> from the sge installer is a good one as well. That's always a bit of a
> "fingers crossed" installer for me.
> Best,
> Adam
> On Tue, Mar 15, 2011 at 6:29 PM, Justin Riley <jtriley_at_mit.edu> wrote:
>>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> Hi Joseph,
>>
>> Have a look here for Adam Kraut's account of launching an 80 node
>> StarCluster:
>>
>> http://mailman.mit.edu/pipermail/starcluster/2010-December/000552.html
>>
>> It's *definitely* a matter of a huge delay involved in setting up a
>> large number of nodes so be patient. This is because StarCluster was
>> originally intended for 10-20 nodes and uses a fairly naive approach of
>> waiting for all nodes to come up (including ssh) and then setting up
>> each node one-by-one. I have several ideas on how to substantially speed
>> this process up in the future but for now I'm concentrating on getting a
>> release out within the next couple weeks.
>>
>> However, after the release I'd be interested in profiling StarCluster
>> during such a large run and seeing where the slowness comes from in
>> detail. I'm definitely interested in getting StarCluster to a point
>> where it can launch large clusters in a reasonable amount of time. As it
>> stands now I need to drastically up my instance limit in order to test
>> and debug this properly. Would you be interested in profiling one of
>> your large runs and sending me the output some time after the next
>> release?
>>
>> ~Justin
>>
>>
>>
>> On 03/15/2011 06:03 PM, Kyeong Soo (Joseph) Kim wrote:
>> > Hi Austin,
>> >
>> > Yes, I requested to increase the limit to 300 (got a confirmation, of
>> > course) and now successfully running a 50-node cluster (it took 32
>> > mins BTW).
>> > I wonder now if it simply is a matter of huge delay involved with
>> > setting up such a large number nodes.
>> >
>> > Regards,
>> > Joseph
>> >
>> >
>> > On Tue, Mar 15, 2011 at 9:53 PM, Austin Godber <godber_at_uberhip.com>
>> > wrote:
>> >> Does it work at 20 and fail at 21? I think Amazon still has a 20 AMIs
>> >> limit, which you can request that they raise. Have you done that?
>> >>
>> >>
>> >> http://aws.amazon.com/ec2/faqs/#How_many_instances_can_I_run_in_Amazon_EC2
>> >>
>> >> Austin
>> >>
>> >> On 03/15/2011 05:29 PM, Kyeong Soo (Joseph) Kim wrote:
>> >>> Hi Justin and All,
>> >>>
>> >>> This is to report a failure in launching a large cluster with 125
>> >>> nodes (c1.xlarge).
>> >>>
>> >>> I tried to launch the said cluster two times but starcluster hung (for
>> >>> more than hours) at the following steps:
>> >>>
>> >>> .....
>> >>>
>> >>>>>> Launching node121 (ami: ami-2857a641, type: c1.xlarge)
>> >>>>>> Launching node122 (ami: ami-2857a641, type: c1.xlarge)
>> >>>>>> Launching node123 (ami: ami-2857a641, type: c1.xlarge)
>> >>>>>> Launching node124 (ami: ami-2857a641, type: c1.xlarge)
>> >>>>>> Creating security group _at_sc-hnrlcluster...
>> >>> Reservation:r-7c264911
>> >>>>>> Waiting for cluster to come up... (updating every 30s)
>> >>>>>> Waiting for all nodes to be in a 'running' state...
>> >>> 125/125
>> >>> |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
>> >>> 100%
>> >>>>>> Waiting for SSH to come up on all nodes...
>> >>> 125/125
>> >>> |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
>> >>> 100%
>> >>>>>> The master node is ec2-75-101-230-197.compute-1.amazonaws.com
>> >>>>>> Setting up the cluster...
>> >>>>>> Attaching volume vol-467ecc2e to master node on /dev/sdz ...
>> >>>>>> Configuring hostnames...
>> >>>>>> Mounting EBS volume vol-467ecc2e on /home...
>> >>>>>> Creating cluster user: kks (uid: 1001, gid: 1001)
>> >>>>>> Configuring scratch space for user: kks
>> >>>>>> Configuring /etc/hosts on each node
>> >>>
>> >>> I have succeeded with the configuration up to 15 nodes so far.
>> >>>
>> >>> Any idea?
>> >>>
>> >>> With Regards,
>> >>> Joseph
>> >>> --
>> >>> Kyeong Soo (Joseph) Kim, Ph.D.
>> >>> Senior Lecturer in Networking
>> >>> Room 112, Digital Technium
>> >>> Multidisciplinary Nanotechnology Centre, College of Engineering
>> >>> Swansea University, Singleton Park, Swansea SA2 8PP, Wales UK
>> >>> TEL: +44 (0)1792 602024
>> >>> EMAIL: k.s.kim_at_swansea.ac.uk
>> >>> HOME: http://iat-hnrl.swan.ac.uk/ (group)
>> >>> http://iat-hnrl.swan.ac.uk/~kks/ (personal)
>> >>>
>> >>> _______________________________________________
>> >>> StarCluster mailing list
>> >>> StarCluster_at_mit.edu
>> >>> http://mailman.mit.edu/mailman/listinfo/starcluster
>> >>
>> >> _______________________________________________
>> >> StarCluster mailing list
>> >> StarCluster_at_mit.edu
>> >> http://mailman.mit.edu/mailman/listinfo/starcluster
>> >>
>> >
>> > _______________________________________________
>> > StarCluster mailing list
>> > StarCluster_at_mit.edu
>> > http://mailman.mit.edu/mailman/listinfo/starcluster
>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v2.0.17 (GNU/Linux)
>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>>
>> iEYEARECAAYFAk1/6F0ACgkQ4llAkMfDcrnDXQCfap8Q42lXaahHbQu3No3pp5Nz
>> x3oAn2Rl50ImIxnWBxf9di1vaFgLm1AO
>> =wUaZ
>> -----END PGP SIGNATURE-----
>> _______________________________________________
>> StarCluster mailing list
>> StarCluster_at_mit.edu
>> http://mailman.mit.edu/mailman/listinfo/starcluster
>
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
>
Received on Fri Mar 25 2011 - 11:40:30 EDT

This message: [ Message body ]
Next message: Matthew Summers: "Re: Large cluster (125 nodes) launch failure"
Previous message: Adam: "Re: Large cluster (125 nodes) launch failure"
In reply to: Adam: "Re: Large cluster (125 nodes) launch failure"
Next in thread: Matthew Summers: "Re: Large cluster (125 nodes) launch failure"
Reply: Matthew Summers: "Re: Large cluster (125 nodes) launch failure"
Reply: Rajat Banerjee: "Re: Large cluster (125 nodes) launch failure"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Navigation

Re: Large cluster (125 nodes) launch failure

Search:

Sort all by:

Navigation