StarCluster - Mailing List Archive

Re: StarCluster Digest, Vol 22, Issue 1

From: Don MacMillen <no email>
Date: Thu, 2 Jun 2011 06:21:08 -0700

Thanks Raj. Some additional information. We are using
slightly modified EBS backed StarCluster images. (Mods were only
to add our executables). The EBS backed images probably have
different timing characteristics than the instance store ones.

Another quick question: Does 'starcluster terminate <clustername>'
call the 'on_remove_node' method of the plugin? It looks like
it does not but apologies if this is documented already. From our
point of view, it would be useful for the terminate cluster command
to call this method.

Many thanks.

Don


On Wed, Jun 1, 2011 at 9:15 AM, <starcluster-request_at_mit.edu> wrote:

> Send StarCluster mailing list submissions to
> starcluster_at_mit.edu
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://mailman.mit.edu/mailman/listinfo/starcluster
> or, via email, send a message with subject or body 'help' to
> starcluster-request_at_mit.edu
>
> You can reach the person managing the list at
> starcluster-owner_at_mit.edu
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of StarCluster digest..."
>
>
> Today's Topics:
>
> 1. Orphaned nodes (addnode failure) and ELB going over max
> cluster size when adding more than one node (Don MacMillen)
> 2. Re: Orphaned nodes (addnode failure) and ELB going over max
> cluster size when adding more than one node (Rajat Banerjee)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Sun, 29 May 2011 13:19:03 -0700
> From: Don MacMillen <macd_at_nimbic.com>
> Subject: [StarCluster] Orphaned nodes (addnode failure) and ELB going
> over max cluster size when adding more than one node
> To: starcluster_at_mit.edu
> Message-ID: <BANLkTimKLfR-RMGV9OB==W_H_daXvhzxWg_at_mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> HI,
>
> Two issues here, as reported earlier. On the first one, running with new
> logging
> turned on, I see an intermittent failure of 'starcluster addnode
> <clustername>'.
> Error trace from log file below.
>
> Second, on ELB adding too many nodes when adding more than one node
> per iteration. The code at
> StarCluster/starcluster/plugins/sge/__init__.py
> at line 637 reads:
>
> if need_to_add > 0:
> need_to_add = min(self.add_nodes_per_iteration, need_to_add)
>
> The fix could be as simple as:
>
> if need_to_add > 0:
> head_room = self.max_nodes - self.stat.hosts
> need_to_add = min(self.add_nodes_per_iteration, need_to_add,
> head_room)
>
> depending upon what you know about self.max_node and self.stat.hosts.
>
> Regards,
>
> Don
>
> PID: 12630 cluster.py:686 - DEBUG - returning self._nodes = [<Node: master
> (i-13d0707d)>, <Node: node001 (i-11d0\
> 707f)>, <Node: node002 (i-efd07081)>, <Node: node003 (i-edd07083)>]
> PID: 12630 cluster.py:670 - DEBUG - existing nodes: {u'i-13d0707d': <Node:
> master (i-13d0707d)>, u'i-11d0707f': \
> <Node: node001 (i-11d0707f)>, u'i-efd07081': <Node: node002 (i-efd07081)>,
> u'i-edd07083': <Node: node003 (i-edd0\
> 7083)>}
> PID: 12630 cluster.py:673 - DEBUG - updating existing node i-13d0707d in
> self._nodes
> PID: 12630 cluster.py:673 - DEBUG - updating existing node i-11d0707f in
> self._nodes
> PID: 12630 cluster.py:673 - DEBUG - updating existing node i-efd07081 in
> self._nodes
> PID: 12630 cluster.py:673 - DEBUG - updating existing node i-edd07083 in
> self._nodes
> PID: 12630 cluster.py:686 - DEBUG - returning self._nodes = [<Node: master
> (i-13d0707d)>, <Node: node001 (i-11d0\
> 707f)>, <Node: node002 (i-efd07081)>, <Node: node003 (i-edd07083)>]
> PID: 12630 clustersetup.py:96 - INFO - Configuring hostnames...
> PID: 12630 cli.py:182 - DEBUG - Traceback (most recent call last):
> File "build/bdist.linux-i686/egg/starcluster/cli.py", line 160, in main
> sc.execute(args)
> File "build/bdist.linux-i686/egg/starcluster/commands/addnode.py", line
> 37, in execute
> self.cm.add_node(tag, aliases)
> File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 119, in
> add_node
> cl.add_node(alias)
> File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 770, in
> add_node
> self.add_nodes(1, aliases=aliases)
> File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 805, in
> add_nodes
> self.volumes)
> File "build/bdist.linux-i686/egg/starcluster/clustersetup.py", line 510,
> in on_add_node
> self._setup_hostnames(nodes=[node])
> File "build/bdist.linux-i686/egg/starcluster/clustersetup.py", line 98, in
> _setup_hostnames
> self.pool.simple_job(node.set_hostname, (), jobid=node.alias)
> AttributeError: 'NoneType' object has no attribute 'set_hostname'
>
> PID: 12630 cli.py:129 - ERROR - Oops! Looks like you've found a bug in
> StarCluster
> PID: 12630 cli.py:130 - ERROR - Debug file written to:
> /tmp/starcluster-debug-staruser.log
> PID: 12630 cli.py:131 - ERROR - Look for lines starting with PID: 12630
> PID: 12630 cli.py:132 - ERROR - Please submit this file, minus any private
> information,
> PID: 12630 cli.py:133 - ERROR - to starcluster_at_mit.edu
> PID: 12630 ssh.py:536 - DEBUG - __del__ called
> PID: 12630 ssh.py:536 - DEBUG - __del__ called
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://mailman.mit.edu/pipermail/starcluster/attachments/20110529/8f412459/attachment-0001.htm
>
> ------------------------------
>
> Message: 2
> Date: Tue, 31 May 2011 14:46:54 -0400
> From: Rajat Banerjee <rbanerj_at_fas.harvard.edu>
> Subject: Re: [StarCluster] Orphaned nodes (addnode failure) and ELB
> going over max cluster size when adding more than one node
> To: Don MacMillen <macd_at_nimbic.com>
> Cc: starcluster_at_mit.edu
> Message-ID: <BANLkTi=a0ZFQJ1B78-bBi_Zpw7A0PXSZZg_at_mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> Hi Don,
> Thanks for your suggestions. I will test out the earlier suggestion as
> soon as I can and issue a pull request so that Justin can put it into
> the master branch.
>
> Regarding the latter suggestion, I think the add_node code needs to be
> more robust to detect and correct timing errors. Will work on it...
>
> Raj
>
> On Sun, May 29, 2011 at 4:19 PM, Don MacMillen <macd_at_nimbic.com> wrote:
> > HI,
> >
> > Two issues here, as reported earlier.? On the first one, running with new
> > logging
> > turned on, I see an intermittent failure of 'starcluster addnode
> > <clustername>'.
> > Error trace from log file below.
> >
> > Second, on ELB adding too many nodes when adding more than one node
> > per iteration.? The code at?
> StarCluster/starcluster/plugins/sge/__init__.py
> > at line 637 reads:
> >
> > ??????? if need_to_add > 0:
> > ??????????? need_to_add = min(self.add_nodes_per_iteration, need_to_add)
> >
> > The fix could be as simple as:
> >
> > ??????? if need_to_add > 0:
> > ??????????? head_room = self.max_nodes - self.stat.hosts
> > ??????????? need_to_add = min(self.add_nodes_per_iteration, need_to_add,
> > head_room)
> >
> > depending upon what you know about self.max_node and self.stat.hosts.
> >
> > Regards,
> >
> > Don
> >
> > PID: 12630 cluster.py:686 - DEBUG - returning self._nodes = [<Node:
> master
> > (i-13d0707d)>, <Node: node001 (i-11d0\
> > 707f)>, <Node: node002 (i-efd07081)>, <Node: node003 (i-edd07083)>]
> > PID: 12630 cluster.py:670 - DEBUG - existing nodes: {u'i-13d0707d':
> <Node:
> > master (i-13d0707d)>, u'i-11d0707f': \
> > <Node: node001 (i-11d0707f)>, u'i-efd07081': <Node: node002
> (i-efd07081)>,
> > u'i-edd07083': <Node: node003 (i-edd0\
> > 7083)>}
> > PID: 12630 cluster.py:673 - DEBUG - updating existing node i-13d0707d in
> > self._nodes
> > PID: 12630 cluster.py:673 - DEBUG - updating existing node i-11d0707f in
> > self._nodes
> > PID: 12630 cluster.py:673 - DEBUG - updating existing node i-efd07081 in
> > self._nodes
> > PID: 12630 cluster.py:673 - DEBUG - updating existing node i-edd07083 in
> > self._nodes
> > PID: 12630 cluster.py:686 - DEBUG - returning self._nodes = [<Node:
> master
> > (i-13d0707d)>, <Node: node001 (i-11d0\
> > 707f)>, <Node: node002 (i-efd07081)>, <Node: node003 (i-edd07083)>]
> > PID: 12630 clustersetup.py:96 - INFO - Configuring hostnames...
> > PID: 12630 cli.py:182 - DEBUG - Traceback (most recent call last):
> > ? File "build/bdist.linux-i686/egg/starcluster/cli.py", line 160, in main
> > ??? sc.execute(args)
> > ? File "build/bdist.linux-i686/egg/starcluster/commands/addnode.py", line
> > 37, in execute
> > ??? self.cm.add_node(tag, aliases)
> > ? File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 119, in
> > add_node
> > ??? cl.add_node(alias)
> > ? File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 770, in
> > add_node
> > ??? self.add_nodes(1, aliases=aliases)
> > ? File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 805, in
> > add_nodes
> > ??? self.volumes)
> > ? File "build/bdist.linux-i686/egg/starcluster/clustersetup.py", line
> 510,
> > in on_add_node
> > ??? self._setup_hostnames(nodes=[node])
> > ? File "build/bdist.linux-i686/egg/starcluster/clustersetup.py", line 98,
> in
> > _setup_hostnames
> > ??? self.pool.simple_job(node.set_hostname, (), jobid=node.alias)
> > AttributeError: 'NoneType' object has no attribute 'set_hostname'
> >
> > PID: 12630 cli.py:129 - ERROR - Oops! Looks like you've found a bug in
> > StarCluster
> > PID: 12630 cli.py:130 - ERROR - Debug file written to:
> > /tmp/starcluster-debug-staruser.log
> > PID: 12630 cli.py:131 - ERROR - Look for lines starting with PID: 12630
> > PID: 12630 cli.py:132 - ERROR - Please submit this file, minus any
> private
> > information,
> > PID: 12630 cli.py:133 - ERROR - to starcluster_at_mit.edu
> > PID: 12630 ssh.py:536 - DEBUG - __del__ called
> > PID: 12630 ssh.py:536 - DEBUG - __del__ called
> >
> >
> > _______________________________________________
> > StarCluster mailing list
> > StarCluster_at_mit.edu
> > http://mailman.mit.edu/mailman/listinfo/starcluster
> >
> >
>
>
>
> ------------------------------
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
>
> End of StarCluster Digest, Vol 22, Issue 1
> ******************************************
>
Received on Thu Jun 02 2011 - 09:21:12 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject