StarCluster - Mailing List Archive

Re: Orphaned nodes (addnode failure) and ELB going over max cluster size when adding more than one node

From: Rajat Banerjee <no email>
Date: Tue, 31 May 2011 14:46:54 -0400

Hi Don,
Thanks for your suggestions. I will test out the earlier suggestion as
soon as I can and issue a pull request so that Justin can put it into
the master branch.

Regarding the latter suggestion, I think the add_node code needs to be
more robust to detect and correct timing errors. Will work on it...

Raj

On Sun, May 29, 2011 at 4:19 PM, Don MacMillen <macd_at_nimbic.com> wrote:
> HI,
>
> Two issues here, as reported earlier.  On the first one, running with new
> logging
> turned on, I see an intermittent failure of 'starcluster addnode
> <clustername>'.
> Error trace from log file below.
>
> Second, on ELB adding too many nodes when adding more than one node
> per iteration.  The code at  StarCluster/starcluster/plugins/sge/__init__.py
> at line 637 reads:
>
>         if need_to_add > 0:
>             need_to_add = min(self.add_nodes_per_iteration, need_to_add)
>
> The fix could be as simple as:
>
>         if need_to_add > 0:
>             head_room = self.max_nodes - self.stat.hosts
>             need_to_add = min(self.add_nodes_per_iteration, need_to_add,
> head_room)
>
> depending upon what you know about self.max_node and self.stat.hosts.
>
> Regards,
>
> Don
>
> PID: 12630 cluster.py:686 - DEBUG - returning self._nodes = [<Node: master
> (i-13d0707d)>, <Node: node001 (i-11d0\
> 707f)>, <Node: node002 (i-efd07081)>, <Node: node003 (i-edd07083)>]
> PID: 12630 cluster.py:670 - DEBUG - existing nodes: {u'i-13d0707d': <Node:
> master (i-13d0707d)>, u'i-11d0707f': \
> <Node: node001 (i-11d0707f)>, u'i-efd07081': <Node: node002 (i-efd07081)>,
> u'i-edd07083': <Node: node003 (i-edd0\
> 7083)>}
> PID: 12630 cluster.py:673 - DEBUG - updating existing node i-13d0707d in
> self._nodes
> PID: 12630 cluster.py:673 - DEBUG - updating existing node i-11d0707f in
> self._nodes
> PID: 12630 cluster.py:673 - DEBUG - updating existing node i-efd07081 in
> self._nodes
> PID: 12630 cluster.py:673 - DEBUG - updating existing node i-edd07083 in
> self._nodes
> PID: 12630 cluster.py:686 - DEBUG - returning self._nodes = [<Node: master
> (i-13d0707d)>, <Node: node001 (i-11d0\
> 707f)>, <Node: node002 (i-efd07081)>, <Node: node003 (i-edd07083)>]
> PID: 12630 clustersetup.py:96 - INFO - Configuring hostnames...
> PID: 12630 cli.py:182 - DEBUG - Traceback (most recent call last):
>   File "build/bdist.linux-i686/egg/starcluster/cli.py", line 160, in main
>     sc.execute(args)
>   File "build/bdist.linux-i686/egg/starcluster/commands/addnode.py", line
> 37, in execute
>     self.cm.add_node(tag, aliases)
>   File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 119, in
> add_node
>     cl.add_node(alias)
>   File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 770, in
> add_node
>     self.add_nodes(1, aliases=aliases)
>   File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 805, in
> add_nodes
>     self.volumes)
>   File "build/bdist.linux-i686/egg/starcluster/clustersetup.py", line 510,
> in on_add_node
>     self._setup_hostnames(nodes=[node])
>   File "build/bdist.linux-i686/egg/starcluster/clustersetup.py", line 98, in
> _setup_hostnames
>     self.pool.simple_job(node.set_hostname, (), jobid=node.alias)
> AttributeError: 'NoneType' object has no attribute 'set_hostname'
>
> PID: 12630 cli.py:129 - ERROR - Oops! Looks like you've found a bug in
> StarCluster
> PID: 12630 cli.py:130 - ERROR - Debug file written to:
> /tmp/starcluster-debug-staruser.log
> PID: 12630 cli.py:131 - ERROR - Look for lines starting with PID: 12630
> PID: 12630 cli.py:132 - ERROR - Please submit this file, minus any private
> information,
> PID: 12630 cli.py:133 - ERROR - to starcluster_at_mit.edu
> PID: 12630 ssh.py:536 - DEBUG - __del__ called
> PID: 12630 ssh.py:536 - DEBUG - __del__ called
>
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
>
Received on Tue May 31 2011 - 14:47:15 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject