StarCluster - Mailing List Archive

Re: Starcluster stuck during setup

From: Cory Dolphin <no email>
Date: Tue, 25 Mar 2014 21:03:18 -0400

To follow up, after using the hack in the link, I still find that the
cluster takes a LONG time to configure etc/hosts. Any idea why this might
be happening?
Weirder yet, all of the nodes have an identical /etc/hosts file.
I ran a loop sshing into all of the 31 nodes (30 + master) and cat'd the
/etc/hosts file.


On Tue, Mar 25, 2014 at 8:11 PM, Cory Dolphin <wcdolphin_at_gmail.com> wrote:

>
> Whenever I try and add a node to a spot instance cluster, starcluster does
> not properly wait for the spot request to be fulfilled, and instead errors
> out:
>
> starcluster addnode mycluster
> StarCluster - (http://star.mit.edu/cluster) (v. 0.95.3)
> Software Tools for Academics and Researchers (STAR)
> Please submit bug reports to starcluster_at_mit.edu
>
> >>> Launching node(s): node030
> SpotInstanceRequest:sir-85f44249
> >>> Waiting for spot requests to propagate...
> >>> Waiting for node(s) to come up... (updating every 30s)
> >>> Waiting for all nodes to be in a 'running' state...
> 30/30 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> 100%
> >>> Waiting for SSH to come up on all nodes...
> 30/30 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> 100%
> >>> Waiting for cluster to come up took 1.179 mins
> !!! ERROR - node 'node030' does not exist
>
>
> Once the spot instance request is fulfilled, the instance does not have a
> name. Looks like someone else had this problem quite recently<http://star.mit.edu/cluster/mlarchives/2058.html>.
> I wonder what the difference between our setup and yours is?
>
>
> On Tue, Mar 25, 2014 at 7:42 PM, Rayson Ho <raysonlogin_at_gmail.com> wrote:
>
>> If you really have a slow connection, you may consider bootstrapping
>> StarCluster on AWS - ie. configure an m1.small (or even t1.micro) and
>> install StarCluster on that node. In fact, there's a CloudFormation
>> template for that:
>>
>> http://aws.typepad.com/aws/2012/06/ec2-spot-instance-updates-auto-scaling-and-cloudformation-integration-new-sample-app-1.html
>> . On the other hand, it's way easier to do it by hand and just launch
>> an instance from the standard Ubuntu AMI, and then install StarCluster
>> on that instance.
>>
>> And like others mentioned, most large StarClusters are launched by
>> first starting a small cluster, and then grow it dynamically. You
>> should be able to run the addnode command from your qmaster node
>> provided that you have StarCluster setup there (note that your AWS key
>> will be on the EC2 instance so it is slightly more risky if security
>> is the main concern).
>>
>> Rayson
>>
>> ==================================================
>> Open Grid Scheduler - The Official Open Source Grid Engine
>> http://gridscheduler.sourceforge.net/
>> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>>
>>
>> On Tue, Mar 25, 2014 at 8:04 AM, Butson, Christopher <cbutson_at_mcw.edu>
>> wrote:
>> > Interesting: I let it go and it eventually continued but it took over
>> an hour to Configuring passwordless ssh for root. Still waiting for the
>> cluster to finish startup...
>> >
>> > Christopher R. Butson, Ph.D.
>> > Associate Professor
>> > Biotechnology & Bioengineering Center
>> > Departments of Neurology, Neurosurgery, Psychiatry & Behavioral Medicine
>> > Medical College of Wisconsin
>> > (414) 955-2678
>> > cbutson_at_mcw.edu<mailto:cbutson_at_mcw.edu>
>> >
>> >
>> > From: <Butson>, Christopher Butson <cbutson_at_mcw.edu<mailto:
>> cbutson_at_mcw.edu>>
>> > Date: Tuesday, March 25, 2014 12:13 PM
>> > To: "starcluster_at_mit.edu<mailto:starcluster_at_mit.edu>" <
>> starcluster_at_mit.edu<mailto:starcluster_at_mit.edu>>
>> > Subject: Starcluster stuck during setup
>> >
>> > I'm on a slow internet connection overseas, trying to initiate a
>> cluster using StarCluster. Once I type "starcluster start mycluster"
>> everything seems to go ok but it gets stuck at the following point and
>> never seems to get past it:
>> >>>> Mounting all NFS export path(s) on 79 worker node(s)
>> > 79/79
>> |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>> >>>> Setting up NFS took 2.777 mins
>> >>>> Configuring passwordless ssh for root
>> >
>> > Any idea why this might occur? Thanks,
>> > Chris
>> >
>> > Christopher R. Butson, Ph.D.
>> > Associate Professor
>> > Biotechnology & Bioengineering Center
>> > Departments of Neurology, Neurosurgery, Psychiatry & Behavioral Medicine
>> > Medical College of Wisconsin
>> > (414) 955-2678
>> > cbutson_at_mcw.edu<mailto:cbutson_at_mcw.edu>
>> >
>> >
>> > _______________________________________________
>> > StarCluster mailing list
>> > StarCluster_at_mit.edu
>> > http://mailman.mit.edu/mailman/listinfo/starcluster
>> _______________________________________________
>> StarCluster mailing list
>> StarCluster_at_mit.edu
>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>
>
>
Received on Tue Mar 25 2014 - 21:03:19 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject