Hi,
I've read several emails on this list about the delay in starting a large
cluster.
As someone who has personally worked on StarCluster development, many if
not all of the delays are with EC2. It is difficult to start virtual
machines on demand, guarantee that they're up, wait for SSH and the
filesystem shares to become available, and to present the machine as usable
to the end user. There are probably some improvements that can be made to
StarCluster, but the problem lies mainly with EC2.
One suggestion that may work for some would be to use the Elastic Load
Balancer in StarCluster to gradually increase the size of your cluster
until the desired size is attained. You could start the cluster with 50
nodes and submit all of your jobs to the SGE queue, launch the ELB and let
ELB scale up the cluster. This would have several benefits:
- Potential time savings
- Jobs would start as soon as they're queued, won't wait for 200 nodes
to be ready
- Jobs are already running while cluster adds node 50 through node 200
- Less upfront, frustrating waiting
- Potential $ savings
- Not all nodes started at the same time, would save $ by starting some
later
- ELB would shut down idle nodes when job queue empties out
Here is a sample command to start the ELB once the cluster of 50 (or even
less) is up:
$ starcluster loadbalance -n 50 -m 200 -a 10 mycluster
This would start the load balancer with:
'-n 50' = a minimum of 50 nodes
'-m 200' = a maximum of 200 nodes
'-a 10' = add 10 new nodes whenever ELB detects that jobs are waiting and
nodes need to be added to the cluster
'mycluster' = your cluster name
There are a few other options described on this page:
http://web.mit.edu/stardev/cluster/docs/latest/manual/load_balancer.html
That can help, such as changing the stabilization time and polling interval.
Best,
Rajat
Received on Wed Dec 21 2011 - 12:26:34 EST