StarCluster - Mailing List Archive

Re: addnode fails on bid

From: Yugarshi Mondal <no email>
Date: Fri, 14 Feb 2014 14:55:00 -0800

To Starcluster Mailing Archives and future self:

I couldn't diagnose the source of the problem, but there's a hacky
workaround.

Find cluster.py in the installation. I found it here:
/usr/local/lib/python2.7/dist-packages/StarCluster-0.9999-py2.7.egg/starcluster/cluster.py
If its not there, check the installation path that's written to terminal
during the installation (if you can).

Look for the function wait_for_active_spots (it was around line 1360,
they're different in 0.9999 and 0.95)
At the top of the function, after the comments, add:
hackStop = raw_input('Hit Return When all nodes are up...')
Save, exit, and recompile cluster.py (better to make an old copy just in
case).

This change will have the effect of hanging the program after starcluster
sends spot requests. Watch the EC2 console and wait for the spot requests
to be filled.
When they are ALL in a running state, hit enter. Addnode should then
proceed as usual. If you're not adding node (if its your intial spot
cluster start), you can hit enter and let the program do the waiting, only
addnode needs to be manually overseen.

Yoshi


On Thu, Feb 13, 2014 at 8:32 PM, Yugarshi Mondal <ymondal_at_berkeley.edu>wrote:

> Hey Starcluster,
>
> I'm getting the same error as this guy:
> http://star.mit.edu/cluster/mlarchives/1592.html
>
> Briefly:
> When I go to use addnode, a spot request opens on amazon (i'm starting a
> spot cluster, so addnode bids). But starcluster proceeds to try to install
> ssh without waiting for the node to come up.
>
> >>> Launching node(s): node002
> SpotInstanceRequest:sir-b35acc5e
> >>> Waiting for spot requests to propagate...
> >>> Waiting for node(s) to come up... (updating every 30s)
> >>> Waiting for all nodes to be in a 'running' state...
> 2/2 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> 100%
> >>> Waiting for SSH to come up on all nodes...
> 2/2 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> 100%
> >>> Waiting for cluster to come up took 0.020 mins
> !!! ERROR - node 'node002' does not exist
>
> Morever, this only happens when addnode tried to bid (either by defualt
> becuase im running a spot cluster or by inline directive)
>
> I don't know what to try next tho. Do you guys have any ideas where to
> start?
>
> thanks
> Yoshi
>
Received on Fri Feb 14 2014 - 17:55:03 EST
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject