StarCluster - Mailing List Archive

Orphaned nodes (addnode failure) and ELB going over max cluster size when adding more than one node

From: Don MacMillen <no email>
Date: Sun, 29 May 2011 13:19:03 -0700

HI,

Two issues here, as reported earlier. On the first one, running with new
logging
turned on, I see an intermittent failure of 'starcluster addnode
<clustername>'.
Error trace from log file below.

Second, on ELB adding too many nodes when adding more than one node
per iteration. The code at StarCluster/starcluster/plugins/sge/__init__.py
at line 637 reads:

        if need_to_add > 0:
            need_to_add = min(self.add_nodes_per_iteration, need_to_add)

The fix could be as simple as:

        if need_to_add > 0:
            head_room = self.max_nodes - self.stat.hosts
            need_to_add = min(self.add_nodes_per_iteration, need_to_add,
head_room)

depending upon what you know about self.max_node and self.stat.hosts.

Regards,

Don

PID: 12630 cluster.py:686 - DEBUG - returning self._nodes = [<Node: master
(i-13d0707d)>, <Node: node001 (i-11d0\
707f)>, <Node: node002 (i-efd07081)>, <Node: node003 (i-edd07083)>]
PID: 12630 cluster.py:670 - DEBUG - existing nodes: {u'i-13d0707d': <Node:
master (i-13d0707d)>, u'i-11d0707f': \
<Node: node001 (i-11d0707f)>, u'i-efd07081': <Node: node002 (i-efd07081)>,
u'i-edd07083': <Node: node003 (i-edd0\
7083)>}
PID: 12630 cluster.py:673 - DEBUG - updating existing node i-13d0707d in
self._nodes
PID: 12630 cluster.py:673 - DEBUG - updating existing node i-11d0707f in
self._nodes
PID: 12630 cluster.py:673 - DEBUG - updating existing node i-efd07081 in
self._nodes
PID: 12630 cluster.py:673 - DEBUG - updating existing node i-edd07083 in
self._nodes
PID: 12630 cluster.py:686 - DEBUG - returning self._nodes = [<Node: master
(i-13d0707d)>, <Node: node001 (i-11d0\
707f)>, <Node: node002 (i-efd07081)>, <Node: node003 (i-edd07083)>]
PID: 12630 clustersetup.py:96 - INFO - Configuring hostnames...
PID: 12630 cli.py:182 - DEBUG - Traceback (most recent call last):
  File "build/bdist.linux-i686/egg/starcluster/cli.py", line 160, in main
    sc.execute(args)
  File "build/bdist.linux-i686/egg/starcluster/commands/addnode.py", line
37, in execute
    self.cm.add_node(tag, aliases)
  File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 119, in
add_node
    cl.add_node(alias)
  File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 770, in
add_node
    self.add_nodes(1, aliases=aliases)
  File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 805, in
add_nodes
    self.volumes)
  File "build/bdist.linux-i686/egg/starcluster/clustersetup.py", line 510,
in on_add_node
    self._setup_hostnames(nodes=[node])
  File "build/bdist.linux-i686/egg/starcluster/clustersetup.py", line 98, in
_setup_hostnames
    self.pool.simple_job(node.set_hostname, (), jobid=node.alias)
AttributeError: 'NoneType' object has no attribute 'set_hostname'

PID: 12630 cli.py:129 - ERROR - Oops! Looks like you've found a bug in
StarCluster
PID: 12630 cli.py:130 - ERROR - Debug file written to:
/tmp/starcluster-debug-staruser.log
PID: 12630 cli.py:131 - ERROR - Look for lines starting with PID: 12630
PID: 12630 cli.py:132 - ERROR - Please submit this file, minus any private
information,
PID: 12630 cli.py:133 - ERROR - to starcluster_at_mit.edu
PID: 12630 ssh.py:536 - DEBUG - __del__ called
PID: 12630 ssh.py:536 - DEBUG - __del__ called
Received on Sun May 29 2011 - 16:19:11 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject