Hi,
It turns out that there is another error when using ELB that can result in
an orphaned
nodes. The current experiment was to start an initial cluster of cluster
size 3 and let
it sit until ELB has reduced it to only the master. Then submit 500
3-minute jobs to
SGE. As the cluster size ramps, some of the nodes fail to be added into SGE
but
they have been spun up. Our plugin code of 'on_add_node' also fails. Since
this
code attaches a tag to the instance, it is easy to see which one fails.
Logging into the master and
looking at qhost confirms that these nodes are not in the SGE grid and are
so orphaned.
In this experiment, the cluster max given to ELB was 10, the error below
happened
3 times while the error previously submitted happened once. So there were a
total
of 4 orphaned nodes and a total 'cluster' size of 14, of which only 10 are
useable.
Could the plugin code be the culprit here? It is only ssh'ing one command
on the
added node (to start a upstart daemon) and then attaching the tag to the
instance.
That's it.
I won't be able to dig into this issue for at least a week and it would be
great if
someone knows the issue exactly. Many thanks.
Best Regards,
Don MacMillen
BTW, this is code from the github repo that was cloned last Friday.
PID: 1677 cluster.py:651 - DEBUG - updating existing node i-c99d50a7 in
self._nodes
PID: 1677 cluster.py:651 - DEBUG - updating existing node i-39ec2157 in
self._nodes
PID: 1677 cluster.py:664 - DEBUG - returning self._nodes = [<Node: master
(i-c99d50a7)>, <Node: node001 (i-39ec2157)>]
PID: 1677 cluster.py:1045 - INFO - Waiting for all nodes to be in a
'running' state...
PID: 1677 cluster.py:1056 - INFO - Waiting for SSH to come up on all
nodes...
PID: 1677 cluster.py:648 - DEBUG - existing nodes: {u'i-c99d50a7': <Node:
master (i-c99d50a7)>, u'i-39ec2157': <Node: nod\
e001 (i-39ec2157)>}
PID: 1677 cluster.py:651 - DEBUG - updating existing node i-c99d50a7 in
self._nodes
PID: 1677 cluster.py:651 - DEBUG - updating existing node i-39ec2157 in
self._nodes
PID: 1677 cluster.py:664 - DEBUG - returning self._nodes = [<Node: master
(i-c99d50a7)>, <Node: node001 (i-39ec2157)>]
PID: 1677 cluster.py:648 - DEBUG - existing nodes: {u'i-c99d50a7': <Node:
master (i-c99d50a7)>, u'i-39ec2157': <Node: nod\
e001 (i-39ec2157)>}
PID: 1677 cluster.py:651 - DEBUG - updating existing node i-c99d50a7 in
self._nodes
PID: 1677 cluster.py:651 - DEBUG - updating existing node i-39ec2157 in
self._nodes
PID: 1677 cluster.py:664 - DEBUG - returning self._nodes = [<Node: master
(i-c99d50a7)>, <Node: node001 (i-39ec2157)>]
PID: 1677 clustersetup.py:96 - INFO - Configuring
hostnames...
PID: 1677 __init__.py:644 - ERROR - Failed to add new
host.
PID: 1677 __init__.py:645 - DEBUG - Traceback (most recent call
last):
File "build/bdist.linux-i686/egg/starcluster/balancers/sge/__init__.py",
line 642, in _eval_add_node
self._cluster.add_nodes(need_to_add)
File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 783, in
add_nodes
self.volumes)
File "build/bdist.linux-i686/egg/starcluster/clustersetup.py", line 512,
in on_add_node
self._setup_hostnames(nodes=[node])
File "build/bdist.linux-i686/egg/starcluster/clustersetup.py", line 98, in
_setup_hostnames
self.pool.simple_job(node.set_hostname, (),
jobid=node.alias)
AttributeError: 'NoneType' object has no attribute
'set_hostname'
PID: 1677 __init__.py:592 - INFO - Sleeping, looping again in 60
seconds.
Received on Fri May 20 2011 - 14:29:15 EDT
This archive was generated by
hypermail 2.3.0.