Thanks for your detailed response.
On Tue, Jul 26, 2011 at 9:35 AM, Justin Riley <justin.t.riley_at_gmail.com>wrote:
> Hi Don,
> On 07/08/2011 07:51 AM, Don MacMillen wrote:
> > First, apologies for the time lag in my response and thanks for your
> > fixes in
> > ELB, they are appreciated.
> My apologies as well for my lag in response. I've been on vacation and
> am now just getting back to catching up with emails and this one is long
> over due.
> > Also, the on_shutdown is exactly what I wanted. I just make that method,
> > as well as the on_remove_node method, call the same code to shut down
> > a node for our application, which just involves stopping an upstart
> > and deleting a file.
> Great, glad that's working out for you.
> > There are still a couple of issues that I'd like your thoughts on. First
> > that we are still seeing occasional failures due to timing / eventual
> > consistency
> > of adding a node. Here are the relevant lines from the log file:
> > PID: 7860 cluster.py:678 - DEBUG - adding node i-eb030185 to self._nodes
> > list
> > PID: 7860 cli.py:157 - ERROR - InvalidInstanceID.NotFound: The instance
> > ID 'i-eb030185' does not exist
> > Does StarCluster return an error code when this happens? I have looked
> > the code, but not studied it enough to know for sure. When we see
> > starcluster
> > return a non zero, we terminate and then restart the cluster. Is this
> > what you
> > would recommend?
> This is a tricky bug that's hard to debug and fix. The root cause is
> that StarCluster creates the instances and then immediately attempts to
> perform an operation that retrieves the instance's user_data
> (ec2-get-instance-attribute API call). In some cases EC2's
> infrastructure isn't quite 'ready' yet most for the call given that it's
> performed almost instantaneously by StarCluster in order fetch metadata
> assigned to the instance at launch time (such as the node's alias). This
> metadata is then used to tag the instances using the new tags & filters
> API which allows faster and more advanced querying of the nodes. The
> majority of the time the call works without issue, especially on smaller
> clusters, however, every now and again, as you've seen, the call is made
> too soon and this error is produced.
> I'm working on a fix for this that will simply try/catch the call, sleep
> if unsuccessful, and keep trying until it's successful. That should
> reliably fix the issue once and for all (fingers crossed). I've created
> an issue on github to track this:
Yes, I thought it must be an 'eventual consistency' issue with the EC2
database. Your fix sounds great and exactly what we need here.
> > We are also seeing another kind of failure in provisioning the cluster.
> > We have
> > been experimenting with large cluster sizes (130 instances, sometimes
> > the m2.4xlarge machine type). What has happened is that in two of the 9
> > spin
> > ups of these large clusters, a single node does not have the nfs volume
> > mounted
> > correctly. It is, however, inserted into the SGE configuration, so jobs
> > get submitted
> > to the node that can never run. You might argue that 130 is beyond the
> > useful
> > limit of nfs, but our use of it is small and controlled. In any event,
> > we will not be
> > running these large clusters in production, but are rather looking at
> > characterization
> > and stress testing.
> > Since the starcluster documentation recommend checking that nfs is
> > configured
> > correctly on all nodes, can I assume that you have also seen this kind
> > of error?
> > If so, any thoughts on its frequency and root cause?
> > One thing we can think about doing is to check the nfs configuration in
> > the 'add'
> > method of the plugin. Easy enough. But when a failure occurs, we would
> > like
> > to correct it and here is where it gets interesting. What I would like
> > it to just
> > have access to the current Cluster instance and then to call its
> > add_node and
> > remove_node methods, but I have not found a way to accomplish that.
> > it looks like we have to create a new cluster instance and before that a
> > new config
> > instance so something like the following code can be made to work:
> > <code>
> > from starcluster.config import StarClusterConfig
> > from starcluster.cluster import ClusterManager
> > ...
> > cfg = StarClusterConfig(MY_STARCLUSTER_CONFIG_FILE)
> > cfg.load()
> > cm = ClusterManager(cfg)
> > cluster = cm.get_cluster(cluster_name)
> > ...
> > for node in nodes:
> > if not nfs_ok(node):
> > alias = node.alias
> > cluster.remove_node(alias)
> > cluster.add_node(alias)
> > </code>
> > Would you recommend this way of correcting these failures? It seems
> > like a cumbersome way to go about it.
> Did you check the logs to see if there were errors related to
> configuring NFS on the new nodes? Any errors related to mounting NFS
> *should* be printed to the screen as well. Ideally you shouldn't need to
> fiddle with NFS so it'd be good to see what the issue is and try to
> resolve it properly in the core code rather than having it be a concern
> of your plugin. Perhaps we just need a sanity check after mounting the
> NFS-share and retry the mount automatically if that fails. However,
> before going that route I'd like to know if there are specific error
> messages in the log before attempting to blindly remount and hoping it
> fixes the issue.
I did check the logs and they do show the nfs errors. However, I don't
think (can't recall) that the errors were also printed to the screen, but
were definitely in the log file in /tmp
> Also, on the nodes that failed to be configured with NFS properly did
> you check if they were listed in /etc/exports on the master node and
> that /etc/fstab on their local filesystems pointed to the corresponding
> NFS export on the master? If they were I would recommend in the mean
> time to have your plugin retry mounting the drive, possibly creating the
> mount directory if needed:
> mtpt = '/mnt/mymountpoint'
> if not node.ssh.isdir(mtpt):
> node.ssh.execute('mount %s' % mtpt)
I did not think to check /etc/exports on the master nor the /etc/fstab
on the slave nodes. I will be sure to check next time the issue occurs.
I did code the 'error recovery' into our plugin. However, I know that it
still has problems for it failed on the one time a new cluster showed the
nfs problems. However, that was awhile ago and we just terminated the
cluster and spun up a new one. I still need to debug this error handling
code but that will be tricky (but not impossible).
Next time I am able to catch a cluster with a bad nfs, I will do the checks
you suggest and also look to move the plugin code to do the remount as
in your code snippet.
> > A final suggestion / request, can you include timestamps on all the
> > We have seen a great deal of variability in the times needed for startup
> > and it would be great to characterize more closely. For instance, the
> > config time for a cluster size of 4 is usually around 30 seconds, but we
> > have seen it as high as over 3 minutes.
> > As I am sure you know, we could accomplish this by changing some format
> > strings in the logger.py module. Perhaps something like the following:
> > INFO_FORMAT = " ".join(['>>>', "%(asctime)s", "%(message)s\n"])
> > DEBUG_FORMAT = "%(asctime)s %(filename)s:%(lineno)d - %(levelname)s -
> > %(message)s\n"
> > DEFAULT_CONSOLE_FORMAT = "%(asctime) %(levelname)s - %(message)s\n"
> > But perhaps many would find this too ugly?
> I'm game for putting %(asctime)s in the DEBUG_FORMAT string but not in
> the console (INFO) level because it clutters the output. Having it in
> DEBUG_FORMAT will make sure that all messages in the debug file are
> timestamped. This should meet your needs given that *all* log messages
> go to the debug file. Does that work for you? I created an issue to
> track this:
That works great for me. Many thanks.
> > In any event, many thanks for your help and for your great work with
> > StarCluster.
> Thank you, my pleasure. Thanks for testing and reporting issues :D
Received on Wed Jul 27 2011 - 01:56:51 EDT