StarCluster - Mailing List Archive

Re: Star cluster issues when running MIST over the cloud

From: Jacob Barhak <no email>
Date: Wed, 24 Jul 2013 17:47:35 -0500

Thanks Again Rayson,

Your explanation allowed me to continue. I had to jump through several
hoops since the windows cmd terminal does not handle the special characters
well and working with vi without seeing what you do is terrible - perhaps
there is a simple fix for this. Never the less, once the SGE configuration
file was fixed I launched the simulations again.

This time simulations stopped at 1331 jobs waiting in queue. I run qstat -j
and it provided me with the attached output. Basically, it seems that the
system does not handle dependencies I have in the simulation for some
reason or it just overloads. It seems to stop all queues at some point. At
least this is what I understand from the output.

This may explain why this happens around 1300 jobs left. My simulation is
basically composed of 3 parts, the last 2 parts consist of 1248 + 1 jobs
that depend on the first 1248 jobs. My assumption - based on the output are
that dependencies get lost or perhaps overload the system - yet I may be
wrong and this is why I am asking the experts.

Note that this does not happen with smaller simulations such as my test
suite. This also does not happen on an SGE cluster I installed on a single
machine. So is this problem an SGE issue or should I report it to this
group in Star Cluster?

Also and this is a different issue, I tried a larger simulation for about
25K jobs. It seems that the system overloads at some point and gives me the
following error:
Unable to run job: rule "default rule (spool dir)" in spooling context
"flatfile
 spooling" failed writing an object
job 11618 was rejected cause it can't be written: No space left on device.
Exiting.

In the above simulations the cluster was created using:
starcluster start -s 20 -i c1.xlarge mycluster

I assume that the a master with 7gb memory and 4x420GB disk space is
sufficient to support the operations I request. Outside the cloud I have a
single physical machine with 16Gb memory and less than 0.5Tb disk space
allocated that handles this entire simulation on its own without going to
the cloud - yet it takes much more time and therefore the ability to run on
the cloud is important.

If anyone can help me fix those issues I would be grateful since I wish to
release a new version of my code that can run simulations on the cloud in
bigger scales. So far I can run simulations only on small scale clusters on
the cloud due to the above issues.

Again Rayson, thanks for your guidance and hopefully there is a simple
explanation/fix for this.


                 Jacob




On Wed, Jul 24, 2013 at 7:29 AM, Rayson Ho <raysonlogin_at_gmail.com> wrote:

> "qconf -msconf" will try to open the editor, which is vi. And going
> through starcluster sshmaster to launch vi may not be the best
> approach.
>
> This is what I just tested and worked for me:
>
> 1) ssh into the master by running:
> starcluster sshmaster myscluster
>
> 2) launch qconf -msconf and change schedd_job_info to true:
> qconf -msconf
>
> Rayson
>
> ==================================================
> Open Grid Scheduler - The Official Open Source Grid Engine
> http://gridscheduler.sourceforge.net/
>
>
> On Wed, Jul 24, 2013 at 6:05 AM, Jacob Barhak <jacob.barhak_at_gmail.com>
> wrote:
> > Thanks Again Rayson,
> >
> > Yet even with your generous help I am still stuck. Perhaps you can look
> at
> > what I am doing and correct me.
> >
> > I tried running the commands you suggested to reconfigure the scheduler
> and
> > it seems the system hangs on me.
> >
> > Here is what I did:
> >
> > 1. Created a 1 node cluster.
> > 2. Copied the configuration file to my machine using:
> > starcluster get myscluster /opt/sge6/default/common/sched_configuration
> .
> > 3. Modified line 12 in sched_configuration to read
> > schedd_job_info TRUE
> > 4. Copied the modified sched_configuration file to the default local
> > directory on the cluster using:
> > starcluster put mycluster sched_configuration .
> > 5. run the configuration command you suggested:
> > starcluster sshmaster mycluster "qconf -msconf sched_configuration"
> >
> > The system hangs in the last stage and does not return unless I press
> > control break - even control C does not work. I waited a few minutes then
> > terminated the cluster. I double checked this behavior using the -u root
> > argument when running the commands to ensure root privileges.
> >
> > I am using a windows 7 machine to issue those commands. I use the
> PythonXY
> > distribution and installed starcluster using easy_install. I am providing
> > this information to see if there is anything wrong with my system
> > compatibility-wise.
> >
> > I am attaching the configuration file and the transcript.
> >
> > I am getting funny characters and it seems the qconf command I issued
> does
> > not recognize the change in schedd_job_info.
> >
> > Is there any other way I can find out what the problem is - If you
> recall I
> > am trying to figure out why jobs in the queue are not dispatched to run
> and
> > keep waiting forever. This happens after a few hundred jobs I am sending.
> >
> > I hope you could find time to look at this once more, or perhaps someone
> > else can help.
> >
> >
> > Jacob
> >
> >
> >
> >
> > On Tue, Jul 23, 2013 at 6:44 PM, Rayson Ho <raysonlogin_at_gmail.com>
> wrote:
> >>
> >> On Tue, Jul 23, 2013 at 2:08 AM, Jacob Barhak <jacob.barhak_at_gmail.com>
> >> wrote:
> >> > The second issue, however, is reproducible. I just tried again a 20
> node
> >> > cluster.
> >>
> >> Yes, it will always be reproducible and only can start 20 on-demand
> >> instances in a region. Again, you need to fill out the "Request to
> >> Increase Amazon EC2 Instance Limit" form if you want more than 20:
> >>
> >> https://aws.amazon.com/contact-us/ec2-request/
> >>
> >>
> >> > This time I posted 2497 jobs to the queue - each about 1 minute long.
> >> > The
> >> > system stopped sending jobs the queue about half point. There were
> 1310
> >> > jobs
> >> > in the queue when the system stopped sending more jobs.
> >> > When running "qstat -j" the system provided the following answer:
> >> >
> >> > scheduling info: (Collecting of scheduler job information
> is
> >> > turned off)
> >>
> >> That's my fault -- I forgot to point out that the scheduler info is
> >> off by default, and you need to run "qconf -msconf", and change the
> >> parameter "schedd_job_info" to true. See:
> >>
> >> http://gridscheduler.sourceforge.net/htmlman/htmlman5/sched_conf.html
> >>
> >> Rayson
> >>
> >> ==================================================
> >> Open Grid Scheduler - The Official Open Source Grid Engine
> >> http://gridscheduler.sourceforge.net/
> >>
> >>
> >>
> >> >
> >> > I am not familiar with the error messages, yet it seems I need to
> enable
> >> > something that is turned off. If there is quick obvious solution for
> >> > this
> >> > please let me know what to do, otherwise are there any other
> diagnostics
> >> > tools I can use?
> >> >
> >> > Again, thanks for the quick reply and I hope this is an easy fix.
> >> >
> >> > Jacob
> >> >
> >> >
> >> > On Mon, Jul 22, 2013 at 2:51 PM, Rayson Ho <raysonlogin_at_gmail.com>
> >> > wrote:
> >> >>
> >> >> On Sun, Jul 21, 2013 at 1:40 AM, Jacob Barhak <
> jacob.barhak_at_gmail.com>
> >> >> wrote:
> >> >> > 1. Sometime starcluster is unable to properly connect the instances
> >> >> > on
> >> >> > the
> >> >> > start command and cannot mount /home. It happened once when I asked
> >> >> > for
> >> >> > 5
> >> >> > m1.small machines and when I terminated this cluster and started
> >> >> > again
> >> >> > things went fine. Is this intermittent due to cloud traffic or is
> >> >> > this a
> >> >> > bug? Is there a way for me to check why?
> >> >>
> >> >> Can be a problem with the actual hardware - can you ssh into the node
> >> >> and manually mount /home by hand next time you encounter this issue
> >> >> and see if you can reproduce it when run interactively?
> >> >>
> >> >>
> >> >> > 2. After launching 20 c1.xlarge machines and running about 2500
> >> >> > jobs,
> >> >> > each
> >> >> > about 5 minutes long, I encountered a problem after and hour or so.
> >> >> > It
> >> >> > seems
> >> >> > that SGE stopped sending jobs from to the queue to the instances.
> No
> >> >> > error
> >> >> > was found and the queue showed about 850 pending jobs. This did not
> >> >> > change
> >> >> > for a while and I could not find any failure with qstat or qhost.
> No
> >> >> > jobs
> >> >> > were running on any nodes and I waited a while for these to start
> >> >> > without
> >> >> > success. I tried the same thing again after a few hours and it
> seems
> >> >> > that
> >> >> > the cluster stops sending jobs from the queue after about 1600 jobs
> >> >> > have
> >> >> > been submitted. This does not happen when SGE is installed on a
> >> >> > single
> >> >> > Ubuntu machine I have at home. I am trying to figure out what is
> >> >> > wrong.
> >> >> > Did
> >> >> > you impose some limit on the number of jobs? Can this be fixed? I
> >> >> > really
> >> >> > need to submit many jobs - tens of thousands jobs in my full runs.
> >> >> > This
> >> >> > one
> >> >> > was relatively small and still did not pass.
> >> >>
> >> >> Looks like SGE thinks that the nodes are down or in alarm or error
> >> >> state? To find out why SGE thinks there are no nodes available, run:
> >> >> qstat -j
> >> >>
> >> >>
> >> >>
> >> >> > 3. I tried to start Star Cluster with 50 nodes and got an error
> about
> >> >> > exceeding a quota of 20. Is it your quota or Amazon quota? Are
> there
> >> >> > any
> >> >> > other restrictions I should be aware of at the beginning? Also
> after
> >> >> > the
> >> >> > system is unable start the cluster it thinks it is still running
> and
> >> >> > a
> >> >> > terminate command is needed before another start can be issued -
> even
> >> >> > though
> >> >> > nothing got started.
> >> >>
> >> >> It's Amazon's quota. 50 is considered small by AWS standard, and they
> >> >> can give it to you almost right away... You need to request AWS to
> >> >> give you a higher limit:
> >> >> https://aws.amazon.com/contact-us/ec2-request/
> >> >>
> >> >> Note that last year we requested for 10,000 nodes and the whole
> >> >> process took less than 1 day:
> >> >>
> >> >>
> >> >>
> >> >>
> http://blogs.scalablelogic.com/2012/11/running-10000-node-grid-engine-cluster.html
> >> >>
> >> >> Rayson
> >> >>
> >> >> ==================================================
> >> >> Open Grid Scheduler - The Official Open Source Grid Engine
> >> >> http://gridscheduler.sourceforge.net/
> >> >>
> >> >>
> >> >> >
> >> >> > This all happened on us-west-2 with the help of star cluster 0.93.3
> >> >> > and
> >> >> > the
> >> >> > Anaconda AMI - ami-a4d64194
> >> >> >
> >> >> > Here is some more information on what I am doing to help you answer
> >> >> > the
> >> >> > above.
> >> >> >
> >> >> > I am running Monte Carlo simulations to simulate chronic disease
> >> >> > progression. I am using MIST to run over the cloud:
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> https://github.com/scipy/scipy2013_talks/blob/master/talks/jacob_barhak/readme.md
> >> >> >
> >> >> > The Reference Model is what I am running using MIST:
> >> >> >
> >> >> > http://youtu.be/7qxPSgINaD8
> >> >> >
> >> >> > I am launching many simulations in parallel and it takes me days
> on a
> >> >> > single
> >> >> > 8 core machine. The cloud allows me to cut down this time to hours.
> >> >> > This
> >> >> > is
> >> >> > why star cluster is so useful. In the past I did this over other
> >> >> > clusters
> >> >> > yet the cloud is still new to me.
> >> >> >
> >> >> > I will appreciate any recommendations I can get from you to improve
> >> >> > the
> >> >> > behaviors I am experiencing.
> >>
> >> http://commons.wikimedia.org/wiki/User:Raysonho
> >
> >
>



Received on Wed Jul 24 2013 - 18:47:38 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject