StarCluster - Mailing List Archive

Re: Starcluster stuck during setup

From: Niklas Krumm <no email>
Date: Tue, 25 Mar 2014 20:38:54 -0700

Hi Cory,

I believe the reason that all the /etc/hosts files are the same is because each node is using the /etc/hosts file to resolve hostnames across your cluster. For large clusters, this poses a bit of a problem, as adding a single node requires updating all other /etc/hosts files (although this is not strictly needed if the nodes dont need to communicate between each other). Furthermore, when adding several nodes, this process is repeated for each added node... such that adding N nodes actually is a O((N**2)/2) operation (I think.).

In any case, there are two modifications to the SC code that can help alleviate this. The first is _at_FinchPowers' PR #347 (https://github.com/jtriley/StarCluster/pull/347), which favors file copy instead of streaming, and dramatically improves the speed of updating files. The second is one I proposed, and uses a DNS server running on the master to resolve hostnames. This basically removes the requirement to update the /etc/hosts file on all nodes when a new node is added; instead, the entry is only added to the master's /etc/host (and served up by dnsmasq). That PR is #321 (https://github.com/jtriley/StarCluster/pull/321). I use both of these and can add nodes to even very large (300+ node) clusters in a constant amount of time. I should mention there are some other N or N**2/2 operations in updating the SGE environment.. these are not strictly needed and can be turned off. I would be happy to discuss this or propse a PR with those modifications in the future...

Good luck,
Nik


On Mar 25, 2014, at 6:03 PM, Cory Dolphin wrote:

> To follow up, after using the hack in the link, I still find that the cluster takes a LONG time to configure etc/hosts. Any idea why this might be happening?
> Weirder yet, all of the nodes have an identical /etc/hosts file.
> I ran a loop sshing into all of the 31 nodes (30 + master) and cat'd the /etc/hosts file.
>
>
> On Tue, Mar 25, 2014 at 8:11 PM, Cory Dolphin <wcdolphin_at_gmail.com> wrote:
>
> Whenever I try and add a node to a spot instance cluster, starcluster does not properly wait for the spot request to be fulfilled, and instead errors out:
>
> starcluster addnode mycluster
> StarCluster - (http://star.mit.edu/cluster) (v. 0.95.3)
> Software Tools for Academics and Researchers (STAR)
> Please submit bug reports to starcluster_at_mit.edu
>
> >>> Launching node(s): node030
> SpotInstanceRequest:sir-85f44249
> >>> Waiting for spot requests to propagate...
> >>> Waiting for node(s) to come up... (updating every 30s)
> >>> Waiting for all nodes to be in a 'running' state...
> 30/30 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
> >>> Waiting for SSH to come up on all nodes...
> 30/30 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
> >>> Waiting for cluster to come up took 1.179 mins
> !!! ERROR - node 'node030' does not exist
>
>
> Once the spot instance request is fulfilled, the instance does not have a name. Looks like someone else had this problem quite recently. I wonder what the difference between our setup and yours is?
>
>
> On Tue, Mar 25, 2014 at 7:42 PM, Rayson Ho <raysonlogin_at_gmail.com> wrote:
> If you really have a slow connection, you may consider bootstrapping
> StarCluster on AWS - ie. configure an m1.small (or even t1.micro) and
> install StarCluster on that node. In fact, there's a CloudFormation
> template for that:
> http://aws.typepad.com/aws/2012/06/ec2-spot-instance-updates-auto-scaling-and-cloudformation-integration-new-sample-app-1.html
> . On the other hand, it's way easier to do it by hand and just launch
> an instance from the standard Ubuntu AMI, and then install StarCluster
> on that instance.
>
> And like others mentioned, most large StarClusters are launched by
> first starting a small cluster, and then grow it dynamically. You
> should be able to run the addnode command from your qmaster node
> provided that you have StarCluster setup there (note that your AWS key
> will be on the EC2 instance so it is slightly more risky if security
> is the main concern).
>
> Rayson
>
> ==================================================
> Open Grid Scheduler - The Official Open Source Grid Engine
> http://gridscheduler.sourceforge.net/
> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>
>
> On Tue, Mar 25, 2014 at 8:04 AM, Butson, Christopher <cbutson_at_mcw.edu> wrote:
> > Interesting: I let it go and it eventually continued but it took over an hour to Configuring passwordless ssh for root. Still waiting for the cluster to finish startup...
> >
> > Christopher R. Butson, Ph.D.
> > Associate Professor
> > Biotechnology & Bioengineering Center
> > Departments of Neurology, Neurosurgery, Psychiatry & Behavioral Medicine
> > Medical College of Wisconsin
> > (414) 955-2678
> > cbutson_at_mcw.edu<mailto:cbutson_at_mcw.edu>
> >
> >
> > From: <Butson>, Christopher Butson <cbutson_at_mcw.edu<mailto:cbutson_at_mcw.edu>>
> > Date: Tuesday, March 25, 2014 12:13 PM
> > To: "starcluster_at_mit.edu<mailto:starcluster_at_mit.edu>" <starcluster_at_mit.edu<mailto:starcluster_at_mit.edu>>
> > Subject: Starcluster stuck during setup
> >
> > I'm on a slow internet connection overseas, trying to initiate a cluster using StarCluster. Once I type "starcluster start mycluster" everything seems to go ok but it gets stuck at the following point and never seems to get past it:
> >>>> Mounting all NFS export path(s) on 79 worker node(s)
> > 79/79 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
> >>>> Setting up NFS took 2.777 mins
> >>>> Configuring passwordless ssh for root
> >
> > Any idea why this might occur? Thanks,
> > Chris
> >
> > Christopher R. Butson, Ph.D.
> > Associate Professor
> > Biotechnology & Bioengineering Center
> > Departments of Neurology, Neurosurgery, Psychiatry & Behavioral Medicine
> > Medical College of Wisconsin
> > (414) 955-2678
> > cbutson_at_mcw.edu<mailto:cbutson_at_mcw.edu>
> >
> >
> > _______________________________________________
> > StarCluster mailing list
> > StarCluster_at_mit.edu
> > http://mailman.mit.edu/mailman/listinfo/starcluster
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
Received on Tue Mar 25 2014 - 23:38:59 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject