Re:  Loadbalancer error - node does not exist
 
Update: It seems like it might start happening whenever the cluster gets up to maximum capacity, which is 16 nodes. Any ideas of what to look for would be appreciated, this is getting expensive.
Amanda
________________________________
From: Amanda Joy Kedaigle
Sent: Thursday, October 02, 2014 12:36 PM
To: starcluster_at_mit.edu
Subject: Loadbalancer error - node does not exist
Hi, I'm running the Elastic LoadBalancer to keep our cluster down to one node when we're not using it, and then ramp up as needed. Generally (i.e. when I run tests and watch it), it works just fine. But twice now, we've had it fail to remove nodes overnight and give the following error, leaving the cluster at full blast with no jobs to run. It says the nodes don't exist, but they are there both on the AWS EC2 console and when I run qhost on the cluster. Any ideas as to the cause? Thanks!
>>> Removing node013 from SGE
!!! ERROR - Error occured while running plugin 'starcluster.plugins.sge.SGEPlugin':
!!! ERROR - Failed to remove node node013
Traceback (most recent call last):
  File "/home/mandyjoy/ENV/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/balancers/sge/__init__.py", line 754, in _eval_remove_node
    self._cluster.remove_node(node)
  File "/home/mandyjoy/ENV/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/cluster.py", line 1050, in remove_node
    force=force)
  File "/home/mandyjoy/ENV/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/cluster.py", line 1076, in remove_nodes
    reverse=True)
  File "/home/mandyjoy/ENV/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/cluster.py", line 1690, in run_plugins
    self.run_plugin(plug, method_name=method_name, node=node)
  File "/home/mandyjoy/ENV/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/cluster.py", line 1715, in run_plugin
    func(*args)
  File "/home/mandyjoy/ENV/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/plugins/sge.py", line 204, in on_remove_node
    self._remove_from_sge(node)
  File "/home/mandyjoy/ENV/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/plugins/sge.py", line 166, in _remove_from_sge
    master.ssh.execute('qconf -de %s' % node.alias)
  File "/home/mandyjoy/ENV/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/sshutils.py", line 579, in execute
    msg, command, exit_status, out_str)
RemoteCommandFailed: remote command 'source /etc/profile && qconf -de node013' failed with status 1:
denied: execution host "node013" does not exist
Received on Sat Oct 04 2014 - 14:07:49 EDT
This archive was generated by
hypermail 2.3.0.