StarCluster - Mailing List Archive

Re: Large cluster (125 nodes) launch failure

From: Justin Riley <no email>
Date: Wed, 16 Mar 2011 12:18:42 -0400

Hash: SHA1


> Now I am a bit at a loss; because I believed that all 50 nodes were
> successfully running as SGE execution nodes, I submitted a huge array
> jobs (total 1000). They are running on 29 nodes, but 21 others are
> sitting idle.

Sorry for this, apparently the installer script that comes with SGE is
partially failing for some reason. I've responded to you and Jeff on the
other thread 'starcluster starts but not all nodes added as exec nodes'
asking for some details.

It's obvious now that the SGE setup code in StarCluster should verify
that SGE has been launched and that all nodes show up in qhost which
shouldn't be too difficult to do. I already have the code that can add
nodes one-by-one to SGE and is being used in the new 'addnode' command.
I just need to change the setup procedure to run this new code instead
of using SGE's installer script which will give me more control.
Unfortunately this will have to wait until after the release but it's
definitely on the TODO list.

> What should I do with those 21 nodes?
> Should I terminate the cluster (together with already running jobs)
> and start with an even smaller configuration?

Well depending on how important the currently running jobs are to you
your options are:

1. wait for your jobs to finish and manually terminate the idle nodes to
stop paying for them in the mean time (tedious)

2. use the new 'restart' command to reboot the instances and completely
reconfigure the cluster (kills running jobs)

3. completely terminate the cluster and launch a smaller one (kills
instances and running jobs)

Hope that helps,

Version: GnuPG v2.0.17 (GNU/Linux)
Comment: Using GnuPG with Mozilla -

Received on Wed Mar 16 2011 - 12:19:06 EDT
This archive was generated by hypermail 2.3.0.


Sort all by: