StarCluster - Mailing List Archive

Update on addnode failure

From: Don MacMillen <no email>
Date: Sun, 29 May 2011 16:02:10 -0700

My last post was from a new email address (my Company is changing names)
and hasn't made it to the list yet. I have included it below and I have an
update
on the intermittent failure of add_nodes.

In the method 'add_nodes' of the class 'Cluster', the call to the method
get_node_alias() at line 801 can sometimes return None for the node,
as doing a conditional break in pdb (b cluster:802 ,node==None) will
show. Calling self.get_node_by_alias(alias) again results in getting
the valid node, so we have a timing problem.

My guess (and this is only a guess) is that there is a problem with
the logic in 'wait_for_cluster'. The guess is simply that one of the
waits in that method is not waiting for the entire new size of cluster
correctly, so that this timing problem would only manifest during an
add_node after an initial cluster spin. In any event, I will let the
experts
take it from here.

BTW, I do not have a good handle on the failure rates, but they look to
be below 10%


Regards,

Don



HI,

Two issues here, as reported earlier. On the first one, running with new
logging
turned on, I see an intermittent failure of 'starcluster addnode
<clustername>'.
Error trace from log file below.

Second, on ELB adding too many nodes when adding more than one node
per iteration. The code at StarCluster/starcluster/
plugins/sge/__init__.py
at line 637 reads:

        if need_to_add > 0:
            need_to_add = min(self.add_nodes_per_iteration, need_to_add)

The fix could be as simple as:

        if need_to_add > 0:
            head_room = self.max_nodes - self.stat.hosts
            need_to_add = min(self.add_nodes_per_iteration, need_to_add,
head_room)

depending upon what you know about self.max_node and self.stat.hosts.

Regards,

Don

PID: 12630 cluster.py:686 - DEBUG - returning self._nodes = [<Node: master
(i-13d0707d)>, <Node: node001 (i-11d0\
707f)>, <Node: node002 (i-efd07081)>, <Node: node003 (i-edd07083)>]
PID: 12630 cluster.py:670 - DEBUG - existing nodes: {u'i-13d0707d': <Node:
master (i-13d0707d)>, u'i-11d0707f': \
<Node: node001 (i-11d0707f)>, u'i-efd07081': <Node: node002 (i-efd07081)>,
u'i-edd07083': <Node: node003 (i-edd0\
7083)>}
PID: 12630 cluster.py:673 - DEBUG - updating existing node i-13d0707d in
self._nodes
PID: 12630 cluster.py:673 - DEBUG - updating existing node i-11d0707f in
self._nodes
PID: 12630 cluster.py:673 - DEBUG - updating existing node i-efd07081 in
self._nodes
PID: 12630 cluster.py:673 - DEBUG - updating existing node i-edd07083 in
self._nodes
PID: 12630 cluster.py:686 - DEBUG - returning self._nodes = [<Node: master
(i-13d0707d)>, <Node: node001 (i-11d0\
707f)>, <Node: node002 (i-efd07081)>, <Node: node003 (i-edd07083)>]
PID: 12630 clustersetup.py:96 - INFO - Configuring hostnames...
PID: 12630 cli.py:182 - DEBUG - Traceback (most recent call last):
  File "build/bdist.linux-i686/egg/starcluster/cli.py", line 160, in main
    sc.execute(args)
  File "build/bdist.linux-i686/egg/starcluster/commands/addnode.py", line
37, in execute
    self.cm.add_node(tag, aliases)
  File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 119, in
add_node
    cl.add_node(alias)
  File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 770, in
add_node
    self.add_nodes(1, aliases=aliases)
  File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 805, in
add_nodes
    self.volumes)
  File "build/bdist.linux-i686/egg/starcluster/clustersetup.py", line 510,
in on_add_node
    self._setup_hostnames(nodes=[node])
  File "build/bdist.linux-i686/egg/starcluster/clustersetup.py", line 98, in
_setup_hostnames
    self.pool.simple_job(node.set_hostname, (), jobid=node.alias)
AttributeError: 'NoneType' object has no attribute 'set_hostname'

PID: 12630 cli.py:129 - ERROR - Oops! Looks like you've found a bug in
StarCluster
PID: 12630 cli.py:130 - ERROR - Debug file written to:
/tmp/starcluster-debug-staruser.log
PID: 12630 cli.py:131 - ERROR - Look for lines starting with PID: 12630
PID: 12630 cli.py:132 - ERROR - Please submit this file, minus any private
information,
PID: 12630 cli.py:133 - ERROR - to starcluster_at_mit.edu
PID: 12630 ssh.py:536 - DEBUG - __del__ called
PID: 12630 ssh.py:536 - DEBUG - __del__ called
Received on Sun May 29 2011 - 19:02:11 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject