starcluster Failed to add new host but continues addings hosts when using loadbalance
Hi all - I'm running starcluster loadbalance, and noticed when it runs into
a problem adding a node, it ignores the problem and continues adding nodes
which also fail...ie the node get started by never added to the SGE grid.
Perhaps a check that all was successful would be in order prior to adding
more nodes:
/opt/sge6
>>> Mounting all NFS export path(s) on 1 worker node(s)
1/1 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
100%
>>> Setting up NFS took 0.018 mins
!!! ERROR - Error occured while running plugin
'starcluster.plugins.sge.SGEPlugin':
!!! ERROR - Failed to add new host
Traceback (most recent call last):
File
"/usr/lib/python2.6/site-packages/StarCluster-0.9999-py2.6.egg/starcluster/balancers/sge/__init__.py",
line 685, in _eval_add_node
self._cluster.add_nodes(need_to_add)
File
"/usr/lib/python2.6/site-packages/StarCluster-0.9999-py2.6.egg/starcluster/cluster.py",
line 892, in add_nodes
self.run_plugins(method_name="on_add_node", node=node)
File
"/usr/lib/python2.6/site-packages/StarCluster-0.9999-py2.6.egg/starcluster/cluster.py",
line 1527, in run_plugins
self.run_plugin(plug, method_name=method_name, node=node)
File
"/usr/lib/python2.6/site-packages/StarCluster-0.9999-py2.6.egg/starcluster/cluster.py",
line 1552, in run_plugin
func(*args)
File
"/usr/lib/python2.6/site-packages/StarCluster-0.9999-py2.6.egg/starcluster/plugins/sge.py",
line 145, in on_add_node
self._add_to_sge(node)
File
"/usr/lib/python2.6/site-packages/StarCluster-0.9999-py2.6.egg/starcluster/plugins/sge.py",
line 30, in _add_to_sge
self._inst_sge(node, exec_host=True)
File
"/usr/lib/python2.6/site-packages/StarCluster-0.9999-py2.6.egg/starcluster/plugins/sge.py",
line 69, in _inst_sge
node.ssh.execute(inst_sge, silent=True, only_printable=True)
File
"/usr/lib/python2.6/site-packages/StarCluster-0.9999-py2.6.egg/starcluster/sshutils/__init__.py",
line 538, in execute
msg, command, exit_status, out_str)
RemoteCommandFailed: remote command 'source /etc/profile && cd /opt/sge6 &&
TERM=rxvt ./inst_sge -x -noremote -auto ./ec2_sge.conf' failed with status
1:
Reading configuration from file ./ec2_sge.conf
[H[2J
>>> Sleeping...(looping again in 60 secs)
Execution hosts: 20
Queued jobs: 37
Oldest queued job: 2013-07-04 13:46:50
Received on Thu Jul 18 2013 - 10:13:23 EDT
This archive was generated by
hypermail 2.3.0.