Update on addnode failure
My last post was from a new email address (my Company is changing names)
and hasn't made it to the list yet. I have included it below and I have an
update
on the intermittent failure of add_nodes.
In the method 'add_nodes' of the class 'Cluster', the call to the method
get_node_alias() at line 801 can sometimes return None for the node,
as doing a conditional break in pdb (b cluster:802 ,node==None) will
show. Calling self.get_node_by_alias(alias) again results in getting
the valid node, so we have a timing problem.
My guess (and this is only a guess) is that there is a problem with
the logic in 'wait_for_cluster'. The guess is simply that one of the
waits in that method is not waiting for the entire new size of cluster
correctly, so that this timing problem would only manifest during an
add_node after an initial cluster spin. In any event, I will let the
experts
take it from here.
BTW, I do not have a good handle on the failure rates, but they look to
be below 10%
Regards,
Don
HI,
Two issues here, as reported earlier. On the first one, running with new
logging
turned on, I see an intermittent failure of 'starcluster addnode
<clustername>'.
Error trace from log file below.
Second, on ELB adding too many nodes when adding more than one node
per iteration. The code at StarCluster/starcluster/
plugins/sge/__init__.py
at line 637 reads:
if need_to_add > 0:
need_to_add = min(self.add_nodes_per_iteration, need_to_add)
The fix could be as simple as:
if need_to_add > 0:
head_room = self.max_nodes - self.stat.hosts
need_to_add = min(self.add_nodes_per_iteration, need_to_add,
head_room)
depending upon what you know about self.max_node and self.stat.hosts.
Regards,
Don
PID: 12630 cluster.py:686 - DEBUG - returning self._nodes = [<Node: master
(i-13d0707d)>, <Node: node001 (i-11d0\
707f)>, <Node: node002 (i-efd07081)>, <Node: node003 (i-edd07083)>]
PID: 12630 cluster.py:670 - DEBUG - existing nodes: {u'i-13d0707d': <Node:
master (i-13d0707d)>, u'i-11d0707f': \
<Node: node001 (i-11d0707f)>, u'i-efd07081': <Node: node002 (i-efd07081)>,
u'i-edd07083': <Node: node003 (i-edd0\
7083)>}
PID: 12630 cluster.py:673 - DEBUG - updating existing node i-13d0707d in
self._nodes
PID: 12630 cluster.py:673 - DEBUG - updating existing node i-11d0707f in
self._nodes
PID: 12630 cluster.py:673 - DEBUG - updating existing node i-efd07081 in
self._nodes
PID: 12630 cluster.py:673 - DEBUG - updating existing node i-edd07083 in
self._nodes
PID: 12630 cluster.py:686 - DEBUG - returning self._nodes = [<Node: master
(i-13d0707d)>, <Node: node001 (i-11d0\
707f)>, <Node: node002 (i-efd07081)>, <Node: node003 (i-edd07083)>]
PID: 12630 clustersetup.py:96 - INFO - Configuring hostnames...
PID: 12630 cli.py:182 - DEBUG - Traceback (most recent call last):
File "build/bdist.linux-i686/egg/starcluster/cli.py", line 160, in main
sc.execute(args)
File "build/bdist.linux-i686/egg/starcluster/commands/addnode.py", line
37, in execute
self.cm.add_node(tag, aliases)
File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 119, in
add_node
cl.add_node(alias)
File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 770, in
add_node
self.add_nodes(1, aliases=aliases)
File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 805, in
add_nodes
self.volumes)
File "build/bdist.linux-i686/egg/starcluster/clustersetup.py", line 510,
in on_add_node
self._setup_hostnames(nodes=[node])
File "build/bdist.linux-i686/egg/starcluster/clustersetup.py", line 98, in
_setup_hostnames
self.pool.simple_job(node.set_hostname, (), jobid=node.alias)
AttributeError: 'NoneType' object has no attribute 'set_hostname'
PID: 12630 cli.py:129 - ERROR - Oops! Looks like you've found a bug in
StarCluster
PID: 12630 cli.py:130 - ERROR - Debug file written to:
/tmp/starcluster-debug-staruser.log
PID: 12630 cli.py:131 - ERROR - Look for lines starting with PID: 12630
PID: 12630 cli.py:132 - ERROR - Please submit this file, minus any private
information,
PID: 12630 cli.py:133 - ERROR - to starcluster_at_mit.edu
PID: 12630 ssh.py:536 - DEBUG - __del__ called
PID: 12630 ssh.py:536 - DEBUG - __del__ called
Received on Sun May 29 2011 - 19:02:11 EDT
This archive was generated by
hypermail 2.3.0.