Star cluster issues when running MIST over the cloud

From: Jacob Barhak <no email>
Date: Sun, 21 Jul 2013 00:40:19 -0500


First, thank you for releasing such a useful tool as Star Cluster.

I just started using it not long ago and have several issues. If you can find the time to help me I will appreciate it.

1. Sometime starcluster is unable to properly connect the instances on the start command and cannot mount /home. It happened once when I asked for 5 m1.small machines and when I terminated this cluster and started again things went fine. Is this intermittent due to cloud traffic or is this a bug? Is there a way for me to check why?

2. After launching 20 c1.xlarge machines and running about 2500 jobs, each about 5 minutes long, I encountered a problem after and hour or so. It seems that SGE stopped sending jobs from to the queue to the instances. No error was found and the queue showed about 850 pending jobs. This did not change for a while and I could not find any failure with qstat or qhost. No jobs were running on any nodes and I waited a while for these to start without success. I tried the same thing again after a few hours and it seems that the cluster stops sending jobs from the queue after about 1600 jobs have been submitted. This does not happen when SGE is installed on a single Ubuntu machine I have at home. I am trying to figure out what is wrong. Did you impose some limit on the number of jobs? Can this be fixed? I really need to submit many jobs - tens of thousands jobs in my full runs. This one was relatively small and still did not pass.

3. I tried to start Star Cluster with 50 nodes and got an error about exceeding a quota of 20. Is it your quota or Amazon quota? Are there any other restrictions I should be aware of at the beginning? Also after the system is unable start the cluster it thinks it is still running and a terminate command is needed before another start can be issued - even though nothing got started.

This all happened on us-west-2 with the help of star cluster 0.93.3 and the Anaconda AMI - ami-a4d64194

Here is some more information on what I am doing to help you answer the above.

I am running Monte Carlo simulations to simulate chronic disease progression. I am using MIST to run over the cloud:

The Reference Model is what I am running using MIST:

I am launching many simulations in parallel and it takes me days on a single 8 core machine. The cloud allows me to cut down this time to hours. This is why star cluster is so useful. In the past I did this over other clusters yet the cloud is still new to me.

I will appreciate any recommendations I can get from you to improve the behaviors I am experiencing.

Jacob Barhak Ph.D.
