StarCluster - Mailing List Archive

Re: Loadbalancer error - node does not exist

From: Amanda Joy Kedaigle <no email>
Date: Sat, 4 Oct 2014 18:07:46 +0000

Update: It seems like it might start happening whenever the cluster gets up to maximum capacity, which is 16 nodes. Any ideas of what to look for would be appreciated, this is getting expensive.

Amanda

________________________________
From: Amanda Joy Kedaigle
Sent: Thursday, October 02, 2014 12:36 PM
To: starcluster_at_mit.edu
Subject: Loadbalancer error - node does not exist

Hi, I'm running the Elastic LoadBalancer to keep our cluster down to one node when we're not using it, and then ramp up as needed. Generally (i.e. when I run tests and watch it), it works just fine. But twice now, we've had it fail to remove nodes overnight and give the following error, leaving the cluster at full blast with no jobs to run. It says the nodes don't exist, but they are there both on the AWS EC2 console and when I run qhost on the cluster. Any ideas as to the cause? Thanks!


>>> Removing node013 from SGE

!!! ERROR - Error occured while running plugin 'starcluster.plugins.sge.SGEPlugin':

!!! ERROR - Failed to remove node node013

Traceback (most recent call last):

  File "/home/mandyjoy/ENV/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/balancers/sge/__init__.py", line 754, in _eval_remove_node

    self._cluster.remove_node(node)

  File "/home/mandyjoy/ENV/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/cluster.py", line 1050, in remove_node

    force=force)

  File "/home/mandyjoy/ENV/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/cluster.py", line 1076, in remove_nodes

    reverse=True)

  File "/home/mandyjoy/ENV/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/cluster.py", line 1690, in run_plugins

    self.run_plugin(plug, method_name=method_name, node=node)

  File "/home/mandyjoy/ENV/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/cluster.py", line 1715, in run_plugin

    func(*args)

  File "/home/mandyjoy/ENV/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/plugins/sge.py", line 204, in on_remove_node

    self._remove_from_sge(node)

  File "/home/mandyjoy/ENV/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/plugins/sge.py", line 166, in _remove_from_sge

    master.ssh.execute('qconf -de %s' % node.alias)

  File "/home/mandyjoy/ENV/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/sshutils.py", line 579, in execute

    msg, command, exit_status, out_str)

RemoteCommandFailed: remote command 'source /etc/profile && qconf -de node013' failed with status 1:

denied: execution host "node013" does not exist
Received on Sat Oct 04 2014 - 14:07:49 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject