StarCluster - Mailing List Archive

Re: Large cluster (125 nodes) launch failure

From: Adam <no email>
Date: Wed, 23 Mar 2011 16:40:51 -0400

Hi Joseph/Justin,

I did have similar problems starting larger clusters. There was actually 2
issues at play.

1) EC2 instances occasionally won't come up with ssh *ever*. In that case
you have to reboot the instance and it should work. This could be something
like 1 in 100 instances or just an anomaly but it's worth noting. The
workaround I used was to manually verify all nodes are running and port 22
is open then run starcluster start -x.

2) StarCluster runs all ssh commands in a single process. There's overhead
for each ssh call so this can end up taking a really long time (30-60 mins)
in some cases. The solution I think is for starcluster to push the ssh
commands into a queue and use multiple processes/threads to run them.

I'm happy to invest some time profiling this or adding parallel ssh support
if there's interest. The latter may be best left to the pythonistas though
;)

I think that might explain "partial" SGE installs. The effort to move away
from the sge installer is a good one as well. That's always a bit of a
"fingers crossed" installer for me.

Best,
Adam

On Tue, Mar 15, 2011 at 6:29 PM, Justin Riley <jtriley_at_mit.edu> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi Joseph,
>
> Have a look here for Adam Kraut's account of launching an 80 node
> StarCluster:
>
> http://mailman.mit.edu/pipermail/starcluster/2010-December/000552.html
>
> It's *definitely* a matter of a huge delay involved in setting up a
> large number of nodes so be patient. This is because StarCluster was
> originally intended for 10-20 nodes and uses a fairly naive approach of
> waiting for all nodes to come up (including ssh) and then setting up
> each node one-by-one. I have several ideas on how to substantially speed
> this process up in the future but for now I'm concentrating on getting a
> release out within the next couple weeks.
>
> However, after the release I'd be interested in profiling StarCluster
> during such a large run and seeing where the slowness comes from in
> detail. I'm definitely interested in getting StarCluster to a point
> where it can launch large clusters in a reasonable amount of time. As it
> stands now I need to drastically up my instance limit in order to test
> and debug this properly. Would you be interested in profiling one of
> your large runs and sending me the output some time after the next release?
>
> ~Justin
>
>
>
> On 03/15/2011 06:03 PM, Kyeong Soo (Joseph) Kim wrote:
> > Hi Austin,
> >
> > Yes, I requested to increase the limit to 300 (got a confirmation, of
> > course) and now successfully running a 50-node cluster (it took 32
> > mins BTW).
> > I wonder now if it simply is a matter of huge delay involved with
> > setting up such a large number nodes.
> >
> > Regards,
> > Joseph
> >
> >
> > On Tue, Mar 15, 2011 at 9:53 PM, Austin Godber <godber_at_uberhip.com>
> wrote:
> >> Does it work at 20 and fail at 21? I think Amazon still has a 20 AMIs
> >> limit, which you can request that they raise. Have you done that?
> >>
> >>
> http://aws.amazon.com/ec2/faqs/#How_many_instances_can_I_run_in_Amazon_EC2
> >>
> >> Austin
> >>
> >> On 03/15/2011 05:29 PM, Kyeong Soo (Joseph) Kim wrote:
> >>> Hi Justin and All,
> >>>
> >>> This is to report a failure in launching a large cluster with 125
> >>> nodes (c1.xlarge).
> >>>
> >>> I tried to launch the said cluster two times but starcluster hung (for
> >>> more than hours) at the following steps:
> >>>
> >>> .....
> >>>
> >>>>>> Launching node121 (ami: ami-2857a641, type: c1.xlarge)
> >>>>>> Launching node122 (ami: ami-2857a641, type: c1.xlarge)
> >>>>>> Launching node123 (ami: ami-2857a641, type: c1.xlarge)
> >>>>>> Launching node124 (ami: ami-2857a641, type: c1.xlarge)
> >>>>>> Creating security group _at_sc-hnrlcluster...
> >>> Reservation:r-7c264911
> >>>>>> Waiting for cluster to come up... (updating every 30s)
> >>>>>> Waiting for all nodes to be in a 'running' state...
> >>> 125/125
> |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> >>> 100%
> >>>>>> Waiting for SSH to come up on all nodes...
> >>> 125/125
> |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> >>> 100%
> >>>>>> The master node is ec2-75-101-230-197.compute-1.amazonaws.com
> >>>>>> Setting up the cluster...
> >>>>>> Attaching volume vol-467ecc2e to master node on /dev/sdz ...
> >>>>>> Configuring hostnames...
> >>>>>> Mounting EBS volume vol-467ecc2e on /home...
> >>>>>> Creating cluster user: kks (uid: 1001, gid: 1001)
> >>>>>> Configuring scratch space for user: kks
> >>>>>> Configuring /etc/hosts on each node
> >>>
> >>> I have succeeded with the configuration up to 15 nodes so far.
> >>>
> >>> Any idea?
> >>>
> >>> With Regards,
> >>> Joseph
> >>> --
> >>> Kyeong Soo (Joseph) Kim, Ph.D.
> >>> Senior Lecturer in Networking
> >>> Room 112, Digital Technium
> >>> Multidisciplinary Nanotechnology Centre, College of Engineering
> >>> Swansea University, Singleton Park, Swansea SA2 8PP, Wales UK
> >>> TEL: +44 (0)1792 602024
> >>> EMAIL: k.s.kim_at_swansea.ac.uk
> >>> HOME: http://iat-hnrl.swan.ac.uk/ (group)
> >>> http://iat-hnrl.swan.ac.uk/~kks/ (personal)
> >>>
> >>> _______________________________________________
> >>> StarCluster mailing list
> >>> StarCluster_at_mit.edu
> >>> http://mailman.mit.edu/mailman/listinfo/starcluster
> >>
> >> _______________________________________________
> >> StarCluster mailing list
> >> StarCluster_at_mit.edu
> >> http://mailman.mit.edu/mailman/listinfo/starcluster
> >>
> >
> > _______________________________________________
> > StarCluster mailing list
> > StarCluster_at_mit.edu
> > http://mailman.mit.edu/mailman/listinfo/starcluster
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v2.0.17 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iEYEARECAAYFAk1/6F0ACgkQ4llAkMfDcrnDXQCfap8Q42lXaahHbQu3No3pp5Nz
> x3oAn2Rl50ImIxnWBxf9di1vaFgLm1AO
> =wUaZ
> -----END PGP SIGNATURE-----
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
Received on Wed Mar 23 2011 - 16:41:12 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject