Orphaned nodes (addnode failure) and ELB going over max cluster size when adding more than one node
HI,
Two issues here, as reported earlier. On the first one, running with new
logging
turned on, I see an intermittent failure of 'starcluster addnode
<clustername>'.
Error trace from log file below.
Second, on ELB adding too many nodes when adding more than one node
per iteration. The code at StarCluster/starcluster/plugins/sge/__init__.py
at line 637 reads:
if need_to_add > 0:
need_to_add = min(self.add_nodes_per_iteration, need_to_add)
The fix could be as simple as:
if need_to_add > 0:
head_room = self.max_nodes - self.stat.hosts
need_to_add = min(self.add_nodes_per_iteration, need_to_add,
head_room)
depending upon what you know about self.max_node and self.stat.hosts.
Regards,
Don
PID: 12630 cluster.py:686 - DEBUG - returning self._nodes = [<Node: master
(i-13d0707d)>, <Node: node001 (i-11d0\
707f)>, <Node: node002 (i-efd07081)>, <Node: node003 (i-edd07083)>]
PID: 12630 cluster.py:670 - DEBUG - existing nodes: {u'i-13d0707d': <Node:
master (i-13d0707d)>, u'i-11d0707f': \
<Node: node001 (i-11d0707f)>, u'i-efd07081': <Node: node002 (i-efd07081)>,
u'i-edd07083': <Node: node003 (i-edd0\
7083)>}
PID: 12630 cluster.py:673 - DEBUG - updating existing node i-13d0707d in
self._nodes
PID: 12630 cluster.py:673 - DEBUG - updating existing node i-11d0707f in
self._nodes
PID: 12630 cluster.py:673 - DEBUG - updating existing node i-efd07081 in
self._nodes
PID: 12630 cluster.py:673 - DEBUG - updating existing node i-edd07083 in
self._nodes
PID: 12630 cluster.py:686 - DEBUG - returning self._nodes = [<Node: master
(i-13d0707d)>, <Node: node001 (i-11d0\
707f)>, <Node: node002 (i-efd07081)>, <Node: node003 (i-edd07083)>]
PID: 12630 clustersetup.py:96 - INFO - Configuring hostnames...
PID: 12630 cli.py:182 - DEBUG - Traceback (most recent call last):
File "build/bdist.linux-i686/egg/starcluster/cli.py", line 160, in main
sc.execute(args)
File "build/bdist.linux-i686/egg/starcluster/commands/addnode.py", line
37, in execute
self.cm.add_node(tag, aliases)
File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 119, in
add_node
cl.add_node(alias)
File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 770, in
add_node
self.add_nodes(1, aliases=aliases)
File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 805, in
add_nodes
self.volumes)
File "build/bdist.linux-i686/egg/starcluster/clustersetup.py", line 510,
in on_add_node
self._setup_hostnames(nodes=[node])
File "build/bdist.linux-i686/egg/starcluster/clustersetup.py", line 98, in
_setup_hostnames
self.pool.simple_job(node.set_hostname, (), jobid=node.alias)
AttributeError: 'NoneType' object has no attribute 'set_hostname'
PID: 12630 cli.py:129 - ERROR - Oops! Looks like you've found a bug in
StarCluster
PID: 12630 cli.py:130 - ERROR - Debug file written to:
/tmp/starcluster-debug-staruser.log
PID: 12630 cli.py:131 - ERROR - Look for lines starting with PID: 12630
PID: 12630 cli.py:132 - ERROR - Please submit this file, minus any private
information,
PID: 12630 cli.py:133 - ERROR - to starcluster_at_mit.edu
PID: 12630 ssh.py:536 - DEBUG - __del__ called
PID: 12630 ssh.py:536 - DEBUG - __del__ called
Received on Sun May 29 2011 - 16:19:11 EDT
This archive was generated by
hypermail 2.3.0.