StarCluster - Mailing List Archive

loadbalance

From: Ryan Golhar <no email>
Date: Tue, 2 Jul 2013 09:59:22 -0400

Hi all - I'm running the latest version of starcluster from github and
using the loadbalance feature. I have 10 jobs in the queue, with 1 running.

starcluster just tried adding a node and failed as follows:

>>> Loading full job history
Execution hosts: 1
Queued jobs: 10
Oldest queued job: 2013-07-02 13:33:29
Avg job duration: 179 secs
Avg job wait time: 119 secs
Last cluster modification time: 2013-07-02 13:36:42
>>> A job has been waiting for 923 sec, longer than max 900
*** WARNING - Adding 1 nodes at 2013-07-02 13:48:52.123504
>>> Launching node(s): node001
SpotInstanceRequest:sir-0c581634
>>> Waiting for node(s) to come up... (updating every 30s)
>>> Waiting for all nodes to be in a 'running' state...
1/1 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
100%
>>> Waiting for SSH to come up on all nodes...
1/1 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
100%
>>> Waiting for cluster to come up took 0.021 mins
!!! ERROR - Failed to add new host
Traceback (most recent call last):
  File
"/usr/lib/python2.6/site-packages/StarCluster-0.9999-py2.6.egg/starcluster/balancers/sge/__init__.py",
line 666, in _eval_add_node
    self._cluster.add_nodes(need_to_add)
  File
"/usr/lib/python2.6/site-packages/StarCluster-0.9999-py2.6.egg/starcluster/cluster.py",
line 888, in add_nodes
    node = self.get_node_by_alias(alias)
  File
"/usr/lib/python2.6/site-packages/StarCluster-0.9999-py2.6.egg/starcluster/cluster.py",
line 732, in get_node_by_alias
    raise exception.InstanceDoesNotExist(alias, label='node')
InstanceDoesNotExist: node 'node001' does not exist
>>> Sleeping...(looping again in 60 secs)


It looks like the node never came up:

[ec2-user_at_ip-10-28-206-211 ~]$ starcluster listclusters
StarCluster - (http://star.mit.edu/cluster) (v. 0.9999)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster_at_mit.edu

-------------------------------------------
ngscluster (security group: _at_sc-ngscluster)
-------------------------------------------
Launch time: 2013-07-02 13:05:52
Uptime: 0 days, 00:45:07
Zone: us-east-1a
Keypair: aws_starcluster_keypair
Spot requests: 1 open
Cluster nodes:
     master running i-65a6f305 ec2-50-19-10-231.compute-1.amazonaws.com
Total nodes: 1


I thought this might be a spot history pricing problem, but my max price is
higher than the avg price. Now when I try to rerun loadbalance, I get the
error:

[ec2-user_at_ip-10-28-206-211 ~]$ starcluster loadbalance -m 20 ngscluster
StarCluster - (http://star.mit.edu/cluster) (v. 0.9999)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster_at_mit.edu

!!! ERROR - cluster ngscluster is not running

However listclusters says its running (and surprisingly node001 is there
too):

[ec2-user_at_ip-10-28-206-211 ~]$ starcluster listclusters
StarCluster - (http://star.mit.edu/cluster) (v. 0.9999)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster_at_mit.edu

-------------------------------------------
ngscluster (security group: _at_sc-ngscluster)
-------------------------------------------
Launch time: 2013-07-02 13:05:52
Uptime: 0 days, 00:50:33
Zone: us-east-1a
Keypair: aws_starcluster_keypair
EBS volumes:
    vol-b46254c9 on master:/dev/sdz (status: attached)
Spot requests: 1 active
Cluster nodes:
     master running i-65a6f305 ec2-50-19-10-231.compute-1.amazonaws.com
    node001 running i-c5886faa
ec2-107-21-176-10.compute-1.amazonaws.com(spot sir-0c581634)
Total nodes: 2

qhost on the cluster doesn't see node001, so I tried to remove the node
 with removenode.

[ec2-user_at_ip-10-28-206-211 ~]$ starcluster removenode ngscluster node001
StarCluster - (http://star.mit.edu/cluster) (v. 0.9999)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster_at_mit.edu

>>> Running plugin setupuserenv.SetupUserEnvironment
>>> Running plugin starcluster.plugins.users.CreateUsers
>>> Running plugin starcluster.plugins.sge.SGEPlugin
>>> Removing node001 from SGE
!!! ERROR - Error occured while running plugin
'starcluster.plugins.sge.SGEPlugin':
!!! ERROR - remote command 'source /etc/profile && qconf -dconf node001'
!!! ERROR - failed with status 1:
!!! ERROR - can't resolve hostname "node001"
!!! ERROR - can't delete configuration "node001" from list:
!!! ERROR - configuration does not exist


How do I get starcluster back in a working state? I *just* started this
cluster...
Received on Tue Jul 02 2013 - 09:59:23 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject