StarCluster - Mailing List Archive

Re: Star cluster issues when running MIST over the cloud

From: Rayson Ho <no email>
Date: Wed, 24 Jul 2013 22:11:39 -0400

On Wed, Jul 24, 2013 at 6:47 PM, Jacob Barhak <jacob.barhak_at_gmail.com> wrote:
> This time simulations stopped at 1331 jobs waiting in queue. I run qstat -j
> and it provided me with the attached output. Basically, it seems that the
> system does not handle dependencies I have in the simulation for some reason
> or it just overloads. It seems to stop all queues at some point. At least
> this is what I understand from the output.

The message "queue instance XX dropped because it is temporarily not
available" is usually a problem with the execution host. So if XX =
"all.q_at_node013", then ssh into node013, and then check the status of
the machine (free disk space, sge_execd process running, etc), and
also read the execd log file at:

/opt/sge6/default/spool/exec_spool_local/<host name of the node you
want to debug>/messages

And also read the cluster master's log file:

/opt/sge6/default/spool/qmaster/messages

Finally, check the SGE status of the node, and from the "states"
column, you can find out what is going on.

root_at_master:/opt/sge6/default/spool/exec_spool_local/master# qstat -f
queuename qtype resv/used/tot. load_avg arch
    states
---------------------------------------------------------------------------------
all.q_at_master BIP 0/0/8 0.08 linux-x64



> This may explain why this happens around 1300 jobs left. My simulation is
> basically composed of 3 parts, the last 2 parts consist of 1248 + 1 jobs
> that depend on the first 1248 jobs. My assumption - based on the output are
> that dependencies get lost or perhaps overload the system - yet I may be
> wrong and this is why I am asking the experts.

Did you specify the job dependencies when you submit jobs??

See the "-hold_jid" parameter:
http://gridscheduler.sourceforge.net/htmlman/htmlman1/qsub.html




> Note that this does not happen with smaller simulations such as my test
> suite. This also does not happen on an SGE cluster I installed on a single
> machine. So is this problem an SGE issue or should I report it to this group
> in Star Cluster?

I have a feeling that smaller simulations generate less output data,
and that's why the disk filled up issue is not affecting you! :-D


> Also and this is a different issue, I tried a larger simulation for about
> 25K jobs. It seems that the system overloads at some point and gives me the
> following error:
> Unable to run job: rule "default rule (spool dir)" in spooling context
> "flatfile
> spooling" failed writing an object
> job 11618 was rejected cause it can't be written: No space left on device.
> Exiting.
>
> In the above simulations the cluster was created using:
> starcluster start -s 20 -i c1.xlarge mycluster

I think it is a very important hint!

The c1.xlarge has 4 disks, but the OS is still sitting in the smaller
disk partition. For the EBS AMIs, you should have this mapping:

# mount
...
/dev/xvda1 on / type ext4 (rw)
/dev/xvdb1 on /mnt type ext3 (rw,_netdev)

On my c1.xlarge instance running ami-765b3e1f, the OS partition (/)
has 6.6GB free, which is enough for the basic stuff, but if you have
output data, then 6.7GB these days is just too small!

On the other hand, /mnt has 401GB free, so that's where you should be
storing your data.

In case you didn't know, run "df -m /" and "df -m /mnt" to find the
amount of free disk space on your master node.


> I assume that the a master with 7gb memory and 4x420GB disk space is
> sufficient to support the operations I request. Outside the cloud I have a
> single physical machine with 16Gb memory and less than 0.5Tb disk space
> allocated that handles this entire simulation on its own without going to
> the cloud - yet it takes much more time and therefore the ability to run on
> the cloud is important.
>
> If anyone can help me fix those issues I would be grateful since I wish to
> release a new version of my code that can run simulations on the cloud in
> bigger scales. So far I can run simulations only on small scale clusters on
> the cloud due to the above issues.

Let me know how it goes, and may be I can IM you and/or fix your
cluster remotely for free if you are working for a non-profit
organization. (Sorry, our clients would complain if we just give away
free support to selected few users...)

Rayson

==================================================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/

>
> Again Rayson, thanks for your guidance and hopefully there is a simple
> explanation/fix for this.
>
>
> Jacob
>
>
>
>
> On Wed, Jul 24, 2013 at 7:29 AM, Rayson Ho <raysonlogin_at_gmail.com> wrote:
>>
>> "qconf -msconf" will try to open the editor, which is vi. And going
>> through starcluster sshmaster to launch vi may not be the best
>> approach.
>>
>> This is what I just tested and worked for me:
>>
>> 1) ssh into the master by running:
>> starcluster sshmaster myscluster
>>
>> 2) launch qconf -msconf and change schedd_job_info to true:
>> qconf -msconf
>>
>> Rayson
>>
>> ==================================================
>> Open Grid Scheduler - The Official Open Source Grid Engine
>> http://gridscheduler.sourceforge.net/
>>
>>
>> On Wed, Jul 24, 2013 at 6:05 AM, Jacob Barhak <jacob.barhak_at_gmail.com>
>> wrote:
>> > Thanks Again Rayson,
>> >
>> > Yet even with your generous help I am still stuck. Perhaps you can look
>> > at
>> > what I am doing and correct me.
>> >
>> > I tried running the commands you suggested to reconfigure the scheduler
>> > and
>> > it seems the system hangs on me.
>> >
>> > Here is what I did:
>> >
>> > 1. Created a 1 node cluster.
>> > 2. Copied the configuration file to my machine using:
>> > starcluster get myscluster /opt/sge6/default/common/sched_configuration
>> > .
>> > 3. Modified line 12 in sched_configuration to read
>> > schedd_job_info TRUE
>> > 4. Copied the modified sched_configuration file to the default local
>> > directory on the cluster using:
>> > starcluster put mycluster sched_configuration .
>> > 5. run the configuration command you suggested:
>> > starcluster sshmaster mycluster "qconf -msconf sched_configuration"
>> >
>> > The system hangs in the last stage and does not return unless I press
>> > control break - even control C does not work. I waited a few minutes
>> > then
>> > terminated the cluster. I double checked this behavior using the -u root
>> > argument when running the commands to ensure root privileges.
>> >
>> > I am using a windows 7 machine to issue those commands. I use the
>> > PythonXY
>> > distribution and installed starcluster using easy_install. I am
>> > providing
>> > this information to see if there is anything wrong with my system
>> > compatibility-wise.
>> >
>> > I am attaching the configuration file and the transcript.
>> >
>> > I am getting funny characters and it seems the qconf command I issued
>> > does
>> > not recognize the change in schedd_job_info.
>> >
>> > Is there any other way I can find out what the problem is - If you
>> > recall I
>> > am trying to figure out why jobs in the queue are not dispatched to run
>> > and
>> > keep waiting forever. This happens after a few hundred jobs I am
>> > sending.
>> >
>> > I hope you could find time to look at this once more, or perhaps someone
>> > else can help.
>> >
>> >
>> > Jacob
>> >
>> >
>> >
>> >
>> > On Tue, Jul 23, 2013 at 6:44 PM, Rayson Ho <raysonlogin_at_gmail.com>
>> > wrote:
>> >>
>> >> On Tue, Jul 23, 2013 at 2:08 AM, Jacob Barhak <jacob.barhak_at_gmail.com>
>> >> wrote:
>> >> > The second issue, however, is reproducible. I just tried again a 20
>> >> > node
>> >> > cluster.
>> >>
>> >> Yes, it will always be reproducible and only can start 20 on-demand
>> >> instances in a region. Again, you need to fill out the "Request to
>> >> Increase Amazon EC2 Instance Limit" form if you want more than 20:
>> >>
>> >> https://aws.amazon.com/contact-us/ec2-request/
>> >>
>> >>
>> >> > This time I posted 2497 jobs to the queue - each about 1 minute long.
>> >> > The
>> >> > system stopped sending jobs the queue about half point. There were
>> >> > 1310
>> >> > jobs
>> >> > in the queue when the system stopped sending more jobs.
>> >> > When running "qstat -j" the system provided the following answer:
>> >> >
>> >> > scheduling info: (Collecting of scheduler job information
>> >> > is
>> >> > turned off)
>> >>
>> >> That's my fault -- I forgot to point out that the scheduler info is
>> >> off by default, and you need to run "qconf -msconf", and change the
>> >> parameter "schedd_job_info" to true. See:
>> >>
>> >> http://gridscheduler.sourceforge.net/htmlman/htmlman5/sched_conf.html
>> >>
>> >> Rayson
>> >>
>> >> ==================================================
>> >> Open Grid Scheduler - The Official Open Source Grid Engine
>> >> http://gridscheduler.sourceforge.net/
>> >>
>> >>
>> >>
>> >> >
>> >> > I am not familiar with the error messages, yet it seems I need to
>> >> > enable
>> >> > something that is turned off. If there is quick obvious solution for
>> >> > this
>> >> > please let me know what to do, otherwise are there any other
>> >> > diagnostics
>> >> > tools I can use?
>> >> >
>> >> > Again, thanks for the quick reply and I hope this is an easy fix.
>> >> >
>> >> > Jacob
>> >> >
>> >> >
>> >> > On Mon, Jul 22, 2013 at 2:51 PM, Rayson Ho <raysonlogin_at_gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> On Sun, Jul 21, 2013 at 1:40 AM, Jacob Barhak
>> >> >> <jacob.barhak_at_gmail.com>
>> >> >> wrote:
>> >> >> > 1. Sometime starcluster is unable to properly connect the
>> >> >> > instances
>> >> >> > on
>> >> >> > the
>> >> >> > start command and cannot mount /home. It happened once when I
>> >> >> > asked
>> >> >> > for
>> >> >> > 5
>> >> >> > m1.small machines and when I terminated this cluster and started
>> >> >> > again
>> >> >> > things went fine. Is this intermittent due to cloud traffic or is
>> >> >> > this a
>> >> >> > bug? Is there a way for me to check why?
>> >> >>
>> >> >> Can be a problem with the actual hardware - can you ssh into the
>> >> >> node
>> >> >> and manually mount /home by hand next time you encounter this issue
>> >> >> and see if you can reproduce it when run interactively?
>> >> >>
>> >> >>
>> >> >> > 2. After launching 20 c1.xlarge machines and running about 2500
>> >> >> > jobs,
>> >> >> > each
>> >> >> > about 5 minutes long, I encountered a problem after and hour or
>> >> >> > so.
>> >> >> > It
>> >> >> > seems
>> >> >> > that SGE stopped sending jobs from to the queue to the instances.
>> >> >> > No
>> >> >> > error
>> >> >> > was found and the queue showed about 850 pending jobs. This did
>> >> >> > not
>> >> >> > change
>> >> >> > for a while and I could not find any failure with qstat or qhost.
>> >> >> > No
>> >> >> > jobs
>> >> >> > were running on any nodes and I waited a while for these to start
>> >> >> > without
>> >> >> > success. I tried the same thing again after a few hours and it
>> >> >> > seems
>> >> >> > that
>> >> >> > the cluster stops sending jobs from the queue after about 1600
>> >> >> > jobs
>> >> >> > have
>> >> >> > been submitted. This does not happen when SGE is installed on a
>> >> >> > single
>> >> >> > Ubuntu machine I have at home. I am trying to figure out what is
>> >> >> > wrong.
>> >> >> > Did
>> >> >> > you impose some limit on the number of jobs? Can this be fixed? I
>> >> >> > really
>> >> >> > need to submit many jobs - tens of thousands jobs in my full runs.
>> >> >> > This
>> >> >> > one
>> >> >> > was relatively small and still did not pass.
>> >> >>
>> >> >> Looks like SGE thinks that the nodes are down or in alarm or error
>> >> >> state? To find out why SGE thinks there are no nodes available, run:
>> >> >> qstat -j
>> >> >>
>> >> >>
>> >> >>
>> >> >> > 3. I tried to start Star Cluster with 50 nodes and got an error
>> >> >> > about
>> >> >> > exceeding a quota of 20. Is it your quota or Amazon quota? Are
>> >> >> > there
>> >> >> > any
>> >> >> > other restrictions I should be aware of at the beginning? Also
>> >> >> > after
>> >> >> > the
>> >> >> > system is unable start the cluster it thinks it is still running
>> >> >> > and
>> >> >> > a
>> >> >> > terminate command is needed before another start can be issued -
>> >> >> > even
>> >> >> > though
>> >> >> > nothing got started.
>> >> >>
>> >> >> It's Amazon's quota. 50 is considered small by AWS standard, and
>> >> >> they
>> >> >> can give it to you almost right away... You need to request AWS to
>> >> >> give you a higher limit:
>> >> >> https://aws.amazon.com/contact-us/ec2-request/
>> >> >>
>> >> >> Note that last year we requested for 10,000 nodes and the whole
>> >> >> process took less than 1 day:
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> http://blogs.scalablelogic.com/2012/11/running-10000-node-grid-engine-cluster.html
>> >> >>
>> >> >> Rayson
>> >> >>
>> >> >> ==================================================
>> >> >> Open Grid Scheduler - The Official Open Source Grid Engine
>> >> >> http://gridscheduler.sourceforge.net/
>> >> >>
>> >> >>
>> >> >> >
>> >> >> > This all happened on us-west-2 with the help of star cluster
>> >> >> > 0.93.3
>> >> >> > and
>> >> >> > the
>> >> >> > Anaconda AMI - ami-a4d64194
>> >> >> >
>> >> >> > Here is some more information on what I am doing to help you
>> >> >> > answer
>> >> >> > the
>> >> >> > above.
>> >> >> >
>> >> >> > I am running Monte Carlo simulations to simulate chronic disease
>> >> >> > progression. I am using MIST to run over the cloud:
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > https://github.com/scipy/scipy2013_talks/blob/master/talks/jacob_barhak/readme.md
>> >> >> >
>> >> >> > The Reference Model is what I am running using MIST:
>> >> >> >
>> >> >> > http://youtu.be/7qxPSgINaD8
>> >> >> >
>> >> >> > I am launching many simulations in parallel and it takes me days
>> >> >> > on a
>> >> >> > single
>> >> >> > 8 core machine. The cloud allows me to cut down this time to
>> >> >> > hours.
>> >> >> > This
>> >> >> > is
>> >> >> > why star cluster is so useful. In the past I did this over other
>> >> >> > clusters
>> >> >> > yet the cloud is still new to me.
>> >> >> >
>> >> >> > I will appreciate any recommendations I can get from you to
>> >> >> > improve
>> >> >> > the
>> >> >> > behaviors I am experiencing.
>> >>
>> >> http://commons.wikimedia.org/wiki/User:Raysonho
>> >
>> >
>
>
Received on Wed Jul 24 2013 - 22:11:43 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject