Re: Orphaned nodes (addnode failure) and ELB going over max cluster size when adding more than one node
Hi Don,
Thanks for your suggestions. I will test out the earlier suggestion as
soon as I can and issue a pull request so that Justin can put it into
the master branch.
Regarding the latter suggestion, I think the add_node code needs to be
more robust to detect and correct timing errors. Will work on it...
Raj
On Sun, May 29, 2011 at 4:19 PM, Don MacMillen <macd_at_nimbic.com> wrote:
> HI,
>
> Two issues here, as reported earlier. On the first one, running with new
> logging
> turned on, I see an intermittent failure of 'starcluster addnode
> <clustername>'.
> Error trace from log file below.
>
> Second, on ELB adding too many nodes when adding more than one node
> per iteration. The code at StarCluster/starcluster/plugins/sge/__init__.py
> at line 637 reads:
>
> if need_to_add > 0:
> need_to_add = min(self.add_nodes_per_iteration, need_to_add)
>
> The fix could be as simple as:
>
> if need_to_add > 0:
> head_room = self.max_nodes - self.stat.hosts
> need_to_add = min(self.add_nodes_per_iteration, need_to_add,
> head_room)
>
> depending upon what you know about self.max_node and self.stat.hosts.
>
> Regards,
>
> Don
>
> PID: 12630 cluster.py:686 - DEBUG - returning self._nodes = [<Node: master
> (i-13d0707d)>, <Node: node001 (i-11d0\
> 707f)>, <Node: node002 (i-efd07081)>, <Node: node003 (i-edd07083)>]
> PID: 12630 cluster.py:670 - DEBUG - existing nodes: {u'i-13d0707d': <Node:
> master (i-13d0707d)>, u'i-11d0707f': \
> <Node: node001 (i-11d0707f)>, u'i-efd07081': <Node: node002 (i-efd07081)>,
> u'i-edd07083': <Node: node003 (i-edd0\
> 7083)>}
> PID: 12630 cluster.py:673 - DEBUG - updating existing node i-13d0707d in
> self._nodes
> PID: 12630 cluster.py:673 - DEBUG - updating existing node i-11d0707f in
> self._nodes
> PID: 12630 cluster.py:673 - DEBUG - updating existing node i-efd07081 in
> self._nodes
> PID: 12630 cluster.py:673 - DEBUG - updating existing node i-edd07083 in
> self._nodes
> PID: 12630 cluster.py:686 - DEBUG - returning self._nodes = [<Node: master
> (i-13d0707d)>, <Node: node001 (i-11d0\
> 707f)>, <Node: node002 (i-efd07081)>, <Node: node003 (i-edd07083)>]
> PID: 12630 clustersetup.py:96 - INFO - Configuring hostnames...
> PID: 12630 cli.py:182 - DEBUG - Traceback (most recent call last):
> File "build/bdist.linux-i686/egg/starcluster/cli.py", line 160, in main
> sc.execute(args)
> File "build/bdist.linux-i686/egg/starcluster/commands/addnode.py", line
> 37, in execute
> self.cm.add_node(tag, aliases)
> File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 119, in
> add_node
> cl.add_node(alias)
> File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 770, in
> add_node
> self.add_nodes(1, aliases=aliases)
> File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 805, in
> add_nodes
> self.volumes)
> File "build/bdist.linux-i686/egg/starcluster/clustersetup.py", line 510,
> in on_add_node
> self._setup_hostnames(nodes=[node])
> File "build/bdist.linux-i686/egg/starcluster/clustersetup.py", line 98, in
> _setup_hostnames
> self.pool.simple_job(node.set_hostname, (), jobid=node.alias)
> AttributeError: 'NoneType' object has no attribute 'set_hostname'
>
> PID: 12630 cli.py:129 - ERROR - Oops! Looks like you've found a bug in
> StarCluster
> PID: 12630 cli.py:130 - ERROR - Debug file written to:
> /tmp/starcluster-debug-staruser.log
> PID: 12630 cli.py:131 - ERROR - Look for lines starting with PID: 12630
> PID: 12630 cli.py:132 - ERROR - Please submit this file, minus any private
> information,
> PID: 12630 cli.py:133 - ERROR - to starcluster_at_mit.edu
> PID: 12630 ssh.py:536 - DEBUG - __del__ called
> PID: 12630 ssh.py:536 - DEBUG - __del__ called
>
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
>
Received on Tue May 31 2011 - 14:47:15 EDT
This archive was generated by
hypermail 2.3.0.