Re: issues with adding multiple nodes to a running cluster
Uggh, this is totally a bug, another user reported on the github issue
tracker and the issue has been fixed on github. I'm releasing 0.93
today (skipping 0.92.2 version given the amount of new stuff in this
release) which should fix this.
Will send an announcement once it's released. Stay tuned....
~Justin
On Tue 03 Jan 2012 03:53:39 PM EST, Wei Tao wrote:
> Hi all,
>
> From time to time, when I tried to add nodes to a running starcluster
> using either the loadbalance or addnodes, starcluster would miss fire.
> For example, I set "-a 5" in loadbalance,
>
> command:
> starcluster loadbalance -m 20 -a 5 -n 1 <mycluster>
>
> here is what I got:
>
> >>> Loading full job history
> Cluster size: 10
> Queued jobs: 361
> Oldest queued job: 2012-01-03 20 <tel:2012-01-03%2020>:13:56
> Avg job duration: 256 secs
> Avg job wait time: 167 secs
> Last cluster modification time: 2012-01-03 20 <tel:2012-01-03%2020>:17:07
> >>> A job has been waiting for 963 sec, longer than max 900
> >>> *** ADDING 5 NODES at 2012-01-03 20 <tel:2012-01-03%2020>:29:59.623917
> >>> Launching node(s): node010, node011, node012, node013, node014
> SpotInstanceRequest:sir-29586e14
> SpotInstanceRequest:sir-46e90414
> SpotInstanceRequest:sir-314a9814
> SpotInstanceRequest:sir-99387e14
> SpotInstanceRequest:sir-9ad72a14
> SpotInstanceRequest:sir-089dcc11
> SpotInstanceRequest:sir-09d28011
> SpotInstanceRequest:sir-64d4dc11
> SpotInstanceRequest:sir-45516411
> SpotInstanceRequest:sir-f2b31a11
> SpotInstanceRequest:sir-0198f214
> SpotInstanceRequest:sir-1db0a014
> SpotInstanceRequest:sir-49c97814
> SpotInstanceRequest:sir-94fdd414
> SpotInstanceRequest:sir-69db0014
> SpotInstanceRequest:sir-6f410612
> SpotInstanceRequest:sir-93c1c012
> SpotInstanceRequest:sir-e44c7c12
> SpotInstanceRequest:sir-dbc51012
> SpotInstanceRequest:sir-aa52dc12
> SpotInstanceRequest:sir-9f9e6811
> SpotInstanceRequest:sir-50053011
> SpotInstanceRequest:sir-33455211
> SpotInstanceRequest:sir-ffcdd011
> SpotInstanceRequest:sir-c1d7ee11
> >>> Waiting for node(s) to come up... (updating every 30s)
> >>> Waiting for open spot requests to become active...
> 34/34
> |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
> >>> Waiting for all nodes to be in a 'running' state...
> 35/35
> |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
> >>> Waiting for SSH to come up on all nodes...
> ^C/35 |||||||||||||||||||||||||||||||||||||||||||||||||||||||
> | 85%
>
> Instead of 5 nodes, 25 nodes were fired up. Did anyone experience
> similar issue? Is this a bug in the code or I miss something in my
> command?
>
> Thanks!
>
>
>
> --
> Wei Tao, Ph.D.
> TSI Biocomputing LLC
> 617-564-0934 <tel:617-564-0934>
>
Received on Tue Jan 03 2012 - 15:55:43 EST
This archive was generated by
hypermail 2.3.0.