Re: Star cluster issues when running MIST over the cloud
On Sun, Jul 21, 2013 at 1:40 AM, Jacob Barhak <jacob.barhak_at_gmail.com> wrote:
> 1. Sometime starcluster is unable to properly connect the instances on the
> start command and cannot mount /home. It happened once when I asked for 5
> m1.small machines and when I terminated this cluster and started again
> things went fine. Is this intermittent due to cloud traffic or is this a
> bug? Is there a way for me to check why?
Can be a problem with the actual hardware - can you ssh into the node
and manually mount /home by hand next time you encounter this issue
and see if you can reproduce it when run interactively?
> 2. After launching 20 c1.xlarge machines and running about 2500 jobs, each
> about 5 minutes long, I encountered a problem after and hour or so. It seems
> that SGE stopped sending jobs from to the queue to the instances. No error
> was found and the queue showed about 850 pending jobs. This did not change
> for a while and I could not find any failure with qstat or qhost. No jobs
> were running on any nodes and I waited a while for these to start without
> success. I tried the same thing again after a few hours and it seems that
> the cluster stops sending jobs from the queue after about 1600 jobs have
> been submitted. This does not happen when SGE is installed on a single
> Ubuntu machine I have at home. I am trying to figure out what is wrong. Did
> you impose some limit on the number of jobs? Can this be fixed? I really
> need to submit many jobs - tens of thousands jobs in my full runs. This one
> was relatively small and still did not pass.
Looks like SGE thinks that the nodes are down or in alarm or error
state? To find out why SGE thinks there are no nodes available, run:
qstat -j
> 3. I tried to start Star Cluster with 50 nodes and got an error about
> exceeding a quota of 20. Is it your quota or Amazon quota? Are there any
> other restrictions I should be aware of at the beginning? Also after the
> system is unable start the cluster it thinks it is still running and a
> terminate command is needed before another start can be issued - even though
> nothing got started.
It's Amazon's quota. 50 is considered small by AWS standard, and they
can give it to you almost right away... You need to request AWS to
give you a higher limit:
https://aws.amazon.com/contact-us/ec2-request/
Note that last year we requested for 10,000 nodes and the whole
process took less than 1 day:
http://blogs.scalablelogic.com/2012/11/running-10000-node-grid-engine-cluster.html
Rayson
==================================================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/
>
> This all happened on us-west-2 with the help of star cluster 0.93.3 and the
> Anaconda AMI - ami-a4d64194
>
> Here is some more information on what I am doing to help you answer the
> above.
>
> I am running Monte Carlo simulations to simulate chronic disease
> progression. I am using MIST to run over the cloud:
>
> https://github.com/scipy/scipy2013_talks/blob/master/talks/jacob_barhak/readme.md
>
> The Reference Model is what I am running using MIST:
>
> http://youtu.be/7qxPSgINaD8
>
> I am launching many simulations in parallel and it takes me days on a single
> 8 core machine. The cloud allows me to cut down this time to hours. This is
> why star cluster is so useful. In the past I did this over other clusters
> yet the cloud is still new to me.
>
> I will appreciate any recommendations I can get from you to improve the
> behaviors I am experiencing.
>
> --
> Jacob Barhak Ph.D.
> http://sites.google.com/site/jacobbarhak/
>
>
> Sent from my iPhone
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
Received on Mon Jul 22 2013 - 15:51:58 EDT
This archive was generated by
hypermail 2.3.0.