Hi Justin,
First, apologies for the time lag in my response and thanks for your fixes
in
ELB, they are appreciated.
Also, the on_shutdown is exactly what I wanted. I just make that method,
as well as the on_remove_node method, call the same code to shut down
a node for our application, which just involves stopping an upstart daemon
and deleting a file.
There are still a couple of issues that I'd like your thoughts on. First is
that we are still seeing occasional failures due to timing / eventual
consistency
of adding a node. Here are the relevant lines from the log file:
PID: 7860 cluster.py:678 - DEBUG - adding node i-eb030185 to self._nodes
list
PID: 7860 cli.py:157 - ERROR - InvalidInstanceID.NotFound: The instance ID
'i-eb030185' does not exist
Does StarCluster return an error code when this happens? I have looked at
the code, but not studied it enough to know for sure. When we see
starcluster
return a non zero, we terminate and then restart the cluster. Is this what
you
would recommend?
We are also seeing another kind of failure in provisioning the cluster. We
have
been experimenting with large cluster sizes (130 instances, sometimes with
the m2.4xlarge machine type). What has happened is that in two of the 9
spin
ups of these large clusters, a single node does not have the nfs volume
mounted
correctly. It is, however, inserted into the SGE configuration, so jobs get
submitted
to the node that can never run. You might argue that 130 is beyond the
useful
limit of nfs, but our use of it is small and controlled. In any event, we
will not be
running these large clusters in production, but are rather looking at
characterization
and stress testing.
Since the starcluster documentation recommend checking that nfs is
configured
correctly on all nodes, can I assume that you have also seen this kind of
error?
If so, any thoughts on its frequency and root cause?
One thing we can think about doing is to check the nfs configuration in the
'add'
method of the plugin. Easy enough. But when a failure occurs, we would
like
to correct it and here is where it gets interesting. What I would like it
to just
have access to the current Cluster instance and then to call its add_node
and
remove_node methods, but I have not found a way to accomplish that. Instead
it looks like we have to create a new cluster instance and before that a new
config
instance so something like the following code can be made to work:
<code>
from starcluster.config import StarClusterConfig
from starcluster.cluster import ClusterManager
...
cfg = StarClusterConfig(MY_STARCLUSTER_CONFIG_FILE)
cfg.load()
cm = ClusterManager(cfg)
cluster = cm.get_cluster(cluster_name)
...
for node in nodes:
if not nfs_ok(node):
alias = node.alias
cluster.remove_node(alias)
cluster.add_node(alias)
</code>
Would you recommend this way of correcting these failures? It seems
like a cumbersome way to go about it.
A final suggestion / request, can you include timestamps on all the loggers?
We have seen a great deal of variability in the times needed for startup
and it would be great to characterize more closely. For instance, the nfs
config time for a cluster size of 4 is usually around 30 seconds, but we
have seen it as high as over 3 minutes.
As I am sure you know, we could accomplish this by changing some format
strings in the logger.py module. Perhaps something like the following:
INFO_FORMAT = " ".join(['>>>', "%(asctime)s", "%(message)s\n"])
DEBUG_FORMAT = "%(asctime)s %(filename)s:%(lineno)d - %(levelname)s -
%(message)s\n"
DEFAULT_CONSOLE_FORMAT = "%(asctime) %(levelname)s - %(message)s\n"
But perhaps many would find this too ugly?
In any event, many thanks for your help and for your great work with
StarCluster.
Best Regards,
Don
On Fri, Jun 3, 2011 at 9:05 AM, Justin Riley <justin.t.riley_at_gmail.com>wrote:
> Hi Don/Raj,
>
> I've merged your pull request with minor changes so you should be able to
> test that the latest load balancer code doesn't add more nodes than it
> should (ie beyond max size). Don't forget to grab the latest code before
> testing. I'm still working on the addnode failures you encountered which I
> don't believe has anything to do with EBS vs instance-store timing. I'll
> post updates when I have new code to test.
>
> On Jun 2, 2011, at 9:21 AM, Don MacMillen wrote:
>
> Another quick question: Does 'starcluster terminate <clustername>'
> call the 'on_remove_node' method of the plugin? It looks like
> it does not but apologies if this is documented already. From our
> point of view, it would be useful for the terminate cluster command
> to call this method.
>
>
> Stop/Terminate doesn't call on_remove_node but instead calls on_shutdown.
> The main difference is that on_shutdown receives *all* the nodes and is not
> called for each individual node to be removed. Will this work for you? You
> can browse the available plugin methods called by StarCluster in
> clustersetup.py:
>
>
> https://github.com/jtriley/StarCluster/blob/master/starcluster/clustersetup.py
>
> Specifically look at the ClusterSetup base class.
>
> HTH,
>
> ~Justin
>
>
Received on Fri Jul 08 2011 - 07:51:18 EDT