StarCluster - Mailing List Archive

Re: Star cluster issues when running MIST over the cloud

From: Jacob Barhak <no email>
Date: Wed, 24 Jul 2013 05:05:37 -0500

Thanks Again Rayson,

Yet even with your generous help I am still stuck. Perhaps you can look at
what I am doing and correct me.

I tried running the commands you suggested to reconfigure the scheduler and
it seems the system hangs on me.

Here is what I did:

1. Created a 1 node cluster.
2. Copied the configuration file to my machine using:
 starcluster get myscluster /opt/sge6/default/common/sched_configuration .
3. Modified line 12 in sched_configuration to read
schedd_job_info TRUE
4. Copied the modified sched_configuration file to the default local
directory on the cluster using:
starcluster put mycluster sched_configuration .
5. run the configuration command you suggested:
starcluster sshmaster mycluster "qconf -msconf sched_configuration"

The system hangs in the last stage and does not return unless I press
control break - even control C does not work. I waited a few minutes then
terminated the cluster. I double checked this behavior using the -u root
argument when running the commands to ensure root privileges.

I am using a windows 7 machine to issue those commands. I use the PythonXY
distribution and installed starcluster using easy_install. I am providing
this information to see if there is anything wrong with my system
compatibility-wise.

I am attaching the configuration file and the transcript.

I am getting funny characters and it seems the qconf command I issued does
not recognize the change in schedd_job_info.

Is there any other way I can find out what the problem is - If you recall I
am trying to figure out why jobs in the queue are not dispatched to run and
keep waiting forever. This happens after a few hundred jobs I am sending.

I hope you could find time to look at this once more, or perhaps someone
else can help.


                   Jacob




On Tue, Jul 23, 2013 at 6:44 PM, Rayson Ho <raysonlogin_at_gmail.com> wrote:

> On Tue, Jul 23, 2013 at 2:08 AM, Jacob Barhak <jacob.barhak_at_gmail.com>
> wrote:
> > The second issue, however, is reproducible. I just tried again a 20 node
> > cluster.
>
> Yes, it will always be reproducible and only can start 20 on-demand
> instances in a region. Again, you need to fill out the "Request to
> Increase Amazon EC2 Instance Limit" form if you want more than 20:
>
> https://aws.amazon.com/contact-us/ec2-request/
>
>
> > This time I posted 2497 jobs to the queue - each about 1 minute long. The
> > system stopped sending jobs the queue about half point. There were 1310
> jobs
> > in the queue when the system stopped sending more jobs.
> > When running "qstat -j" the system provided the following answer:
> >
> > scheduling info: (Collecting of scheduler job information is
> > turned off)
>
> That's my fault -- I forgot to point out that the scheduler info is
> off by default, and you need to run "qconf -msconf", and change the
> parameter "schedd_job_info" to true. See:
>
> http://gridscheduler.sourceforge.net/htmlman/htmlman5/sched_conf.html
>
> Rayson
>
> ==================================================
> Open Grid Scheduler - The Official Open Source Grid Engine
> http://gridscheduler.sourceforge.net/
>
>
>
> >
> > I am not familiar with the error messages, yet it seems I need to enable
> > something that is turned off. If there is quick obvious solution for this
> > please let me know what to do, otherwise are there any other diagnostics
> > tools I can use?
> >
> > Again, thanks for the quick reply and I hope this is an easy fix.
> >
> > Jacob
> >
> >
> > On Mon, Jul 22, 2013 at 2:51 PM, Rayson Ho <raysonlogin_at_gmail.com>
> wrote:
> >>
> >> On Sun, Jul 21, 2013 at 1:40 AM, Jacob Barhak <jacob.barhak_at_gmail.com>
> >> wrote:
> >> > 1. Sometime starcluster is unable to properly connect the instances on
> >> > the
> >> > start command and cannot mount /home. It happened once when I asked
> for
> >> > 5
> >> > m1.small machines and when I terminated this cluster and started again
> >> > things went fine. Is this intermittent due to cloud traffic or is
> this a
> >> > bug? Is there a way for me to check why?
> >>
> >> Can be a problem with the actual hardware - can you ssh into the node
> >> and manually mount /home by hand next time you encounter this issue
> >> and see if you can reproduce it when run interactively?
> >>
> >>
> >> > 2. After launching 20 c1.xlarge machines and running about 2500 jobs,
> >> > each
> >> > about 5 minutes long, I encountered a problem after and hour or so. It
> >> > seems
> >> > that SGE stopped sending jobs from to the queue to the instances. No
> >> > error
> >> > was found and the queue showed about 850 pending jobs. This did not
> >> > change
> >> > for a while and I could not find any failure with qstat or qhost. No
> >> > jobs
> >> > were running on any nodes and I waited a while for these to start
> >> > without
> >> > success. I tried the same thing again after a few hours and it seems
> >> > that
> >> > the cluster stops sending jobs from the queue after about 1600 jobs
> have
> >> > been submitted. This does not happen when SGE is installed on a single
> >> > Ubuntu machine I have at home. I am trying to figure out what is
> wrong.
> >> > Did
> >> > you impose some limit on the number of jobs? Can this be fixed? I
> really
> >> > need to submit many jobs - tens of thousands jobs in my full runs.
> This
> >> > one
> >> > was relatively small and still did not pass.
> >>
> >> Looks like SGE thinks that the nodes are down or in alarm or error
> >> state? To find out why SGE thinks there are no nodes available, run:
> >> qstat -j
> >>
> >>
> >>
> >> > 3. I tried to start Star Cluster with 50 nodes and got an error about
> >> > exceeding a quota of 20. Is it your quota or Amazon quota? Are there
> any
> >> > other restrictions I should be aware of at the beginning? Also after
> the
> >> > system is unable start the cluster it thinks it is still running and a
> >> > terminate command is needed before another start can be issued - even
> >> > though
> >> > nothing got started.
> >>
> >> It's Amazon's quota. 50 is considered small by AWS standard, and they
> >> can give it to you almost right away... You need to request AWS to
> >> give you a higher limit:
> >> https://aws.amazon.com/contact-us/ec2-request/
> >>
> >> Note that last year we requested for 10,000 nodes and the whole
> >> process took less than 1 day:
> >>
> >>
> >>
> http://blogs.scalablelogic.com/2012/11/running-10000-node-grid-engine-cluster.html
> >>
> >> Rayson
> >>
> >> ==================================================
> >> Open Grid Scheduler - The Official Open Source Grid Engine
> >> http://gridscheduler.sourceforge.net/
> >>
> >>
> >> >
> >> > This all happened on us-west-2 with the help of star cluster 0.93.3
> and
> >> > the
> >> > Anaconda AMI - ami-a4d64194
> >> >
> >> > Here is some more information on what I am doing to help you answer
> the
> >> > above.
> >> >
> >> > I am running Monte Carlo simulations to simulate chronic disease
> >> > progression. I am using MIST to run over the cloud:
> >> >
> >> >
> >> >
> https://github.com/scipy/scipy2013_talks/blob/master/talks/jacob_barhak/readme.md
> >> >
> >> > The Reference Model is what I am running using MIST:
> >> >
> >> > http://youtu.be/7qxPSgINaD8
> >> >
> >> > I am launching many simulations in parallel and it takes me days on a
> >> > single
> >> > 8 core machine. The cloud allows me to cut down this time to hours.
> This
> >> > is
> >> > why star cluster is so useful. In the past I did this over other
> >> > clusters
> >> > yet the cloud is still new to me.
> >> >
> >> > I will appreciate any recommendations I can get from you to improve
> the
> >> > behaviors I am experiencing.
>
> http://commons.wikimedia.org/wiki/User:Raysonho
>



Received on Wed Jul 24 2013 - 06:05:42 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject