ELB exceeding cluster size limits.
Hi,
This happens when the dreaded 'Instance ID 'blah' does not exist' error
occurs.
As most of you know, there can be a timing issue in creating an instance,
getting
its ID, and then trying to access it. Everyone must wrestle with this. In
my
code there is a simple back off and retry with a time out attached. Usually
works, but I am not happy with it.
This happened twice while the SC ELB was ramping from 1 to 10, so I wound
up with a cluster size of 12. Of course, a qstat on the master did not show
the two now orphaned nodes and I could kill them. But this is not a robust
solution.
So the question is, just how prevalent is this problem? What do others to
prevent this? I imagine that ELB must use the same addnode code as
the other parts of starcluster so that this is not a problem specific to
ELB.
Any thoughts and comments appreciated.
Regards,
Don
Don MacMillen
PhysWare
PID: 3368 __init__.py:645 - DEBUG - Traceback (most recent call last):
File "build/bdist.linux-i686/egg/starcluster/balancers/sge/__init__.py",
line 642, in _eval_add_node
self._cluster.add_nodes(need_to_add)
File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 775, in
add_nodes
self.wait_for_cluster(msg="Waiting for node(s) to come up...")
File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 1038, in
wait_for_cluster
nodes = self.nodes
File "build/bdist.linux-i686/egg/starcluster/cluster.py", line 658, in
nodes
if n.is_master():
File "build/bdist.linux-i686/egg/starcluster/node.py", line 690, in
is_master
return self.alias == "master"
File "build/bdist.linux-i686/egg/starcluster/node.py", line 89, in alias
user_data = self.ec2.get_instance_user_data(self.id)
File "build/bdist.linux-i686/egg/starcluster/awsutils.py", line 389, in
get_instance_user_data
attributes = self.conn.get_instance_attribute(i.id, 'userData')
File "build/bdist.linux-i686/egg/boto/ec2/connection.py", line 685, in
get_instance_attribute
InstanceAttribute, verb='POST')
File "build/bdist.linux-i686/egg/boto/connection.py", line 611, in
get_object
raise self.ResponseError(response.status, response.reason, body)
EC2ResponseError: EC2ResponseError: 400 Bad Request
<?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>InvalidInstanceID.NotFound</Code><Message>The
instance ID 'i-c9c931a7' does not
exist</Message></Error></Errors><RequestID>05ab9b6e-66bf-4453-b7bc-0d5effaa23af</RequestID></Response>
Received on Wed May 18 2011 - 03:12:27 EDT
This archive was generated by
hypermail 2.3.0.