Re: Star cluster issues when running MIST over the cloud
On Thu, Jul 25, 2013 at 2:54 PM, Jacob Barhak <jacob.barhak_at_gmail.com> wrote:
> 3. I am not using EBS. I am just using the NFS shared disk space on the
> master node. If I understand correctly, it should be sufficient for my needs
> now - I will check the disk space issue and get back to you. never the less,
> in the farther future I may need to use EBS so its good your mentioned this.
While you didn't explicitly ask for EBS, if the AMI is "EBS-backed",
then your root filesystem is still on a EBS volume.
Rayson
==================================================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/
>
> I hope the next post will have a solution.
>
> Jacob
>
>
>
> On Wed, Jul 24, 2013 at 9:11 PM, Rayson Ho <raysonlogin_at_gmail.com> wrote:
>>
>> On Wed, Jul 24, 2013 at 6:47 PM, Jacob Barhak <jacob.barhak_at_gmail.com>
>> wrote:
>> > This time simulations stopped at 1331 jobs waiting in queue. I run qstat
>> > -j
>> > and it provided me with the attached output. Basically, it seems that
>> > the
>> > system does not handle dependencies I have in the simulation for some
>> > reason
>> > or it just overloads. It seems to stop all queues at some point. At
>> > least
>> > this is what I understand from the output.
>>
>> The message "queue instance XX dropped because it is temporarily not
>> available" is usually a problem with the execution host. So if XX =
>> "all.q_at_node013", then ssh into node013, and then check the status of
>> the machine (free disk space, sge_execd process running, etc), and
>> also read the execd log file at:
>>
>> /opt/sge6/default/spool/exec_spool_local/<host name of the node you
>> want to debug>/messages
>>
>> And also read the cluster master's log file:
>>
>> /opt/sge6/default/spool/qmaster/messages
>>
>> Finally, check the SGE status of the node, and from the "states"
>> column, you can find out what is going on.
>>
>> root_at_master:/opt/sge6/default/spool/exec_spool_local/master# qstat -f
>> queuename qtype resv/used/tot. load_avg arch
>> states
>>
>> ---------------------------------------------------------------------------------
>> all.q_at_master BIP 0/0/8 0.08 linux-x64
>>
>>
>>
>> > This may explain why this happens around 1300 jobs left. My simulation
>> > is
>> > basically composed of 3 parts, the last 2 parts consist of 1248 + 1
>> > jobs
>> > that depend on the first 1248 jobs. My assumption - based on the output
>> > are
>> > that dependencies get lost or perhaps overload the system - yet I may be
>> > wrong and this is why I am asking the experts.
>>
>> Did you specify the job dependencies when you submit jobs??
>>
>> See the "-hold_jid" parameter:
>> http://gridscheduler.sourceforge.net/htmlman/htmlman1/qsub.html
>>
>>
>>
>>
>> > Note that this does not happen with smaller simulations such as my test
>> > suite. This also does not happen on an SGE cluster I installed on a
>> > single
>> > machine. So is this problem an SGE issue or should I report it to this
>> > group
>> > in Star Cluster?
>>
>> I have a feeling that smaller simulations generate less output data,
>> and that's why the disk filled up issue is not affecting you! :-D
>>
>>
>> > Also and this is a different issue, I tried a larger simulation for
>> > about
>> > 25K jobs. It seems that the system overloads at some point and gives me
>> > the
>> > following error:
>> > Unable to run job: rule "default rule (spool dir)" in spooling context
>> > "flatfile
>> > spooling" failed writing an object
>> > job 11618 was rejected cause it can't be written: No space left on
>> > device.
>> > Exiting.
>> >
>> > In the above simulations the cluster was created using:
>> > starcluster start -s 20 -i c1.xlarge mycluster
>>
>> I think it is a very important hint!
>>
>> The c1.xlarge has 4 disks, but the OS is still sitting in the smaller
>> disk partition. For the EBS AMIs, you should have this mapping:
>>
>> # mount
>> ...
>> /dev/xvda1 on / type ext4 (rw)
>> /dev/xvdb1 on /mnt type ext3 (rw,_netdev)
>>
>> On my c1.xlarge instance running ami-765b3e1f, the OS partition (/)
>> has 6.6GB free, which is enough for the basic stuff, but if you have
>> output data, then 6.7GB these days is just too small!
>>
>> On the other hand, /mnt has 401GB free, so that's where you should be
>> storing your data.
>>
>> In case you didn't know, run "df -m /" and "df -m /mnt" to find the
>> amount of free disk space on your master node.
>>
>>
>> > I assume that the a master with 7gb memory and 4x420GB disk space is
>> > sufficient to support the operations I request. Outside the cloud I have
>> > a
>> > single physical machine with 16Gb memory and less than 0.5Tb disk space
>> > allocated that handles this entire simulation on its own without going
>> > to
>> > the cloud - yet it takes much more time and therefore the ability to run
>> > on
>> > the cloud is important.
>> >
>> > If anyone can help me fix those issues I would be grateful since I wish
>> > to
>> > release a new version of my code that can run simulations on the cloud
>> > in
>> > bigger scales. So far I can run simulations only on small scale clusters
>> > on
>> > the cloud due to the above issues.
>>
>> Let me know how it goes, and may be I can IM you and/or fix your
>> cluster remotely for free if you are working for a non-profit
>> organization. (Sorry, our clients would complain if we just give away
>> free support to selected few users...)
>>
>> Rayson
>>
>> ==================================================
>> Open Grid Scheduler - The Official Open Source Grid Engine
>> http://gridscheduler.sourceforge.net/
>>
>> >
>> > Again Rayson, thanks for your guidance and hopefully there is a simple
>> > explanation/fix for this.
>> >
>> >
>> > Jacob
>> >
>> >
>> >
>> >
>> > On Wed, Jul 24, 2013 at 7:29 AM, Rayson Ho <raysonlogin_at_gmail.com>
>> > wrote:
>> >>
>> >> "qconf -msconf" will try to open the editor, which is vi. And going
>> >> through starcluster sshmaster to launch vi may not be the best
>> >> approach.
>> >>
>> >> This is what I just tested and worked for me:
>> >>
>> >> 1) ssh into the master by running:
>> >> starcluster sshmaster myscluster
>> >>
>> >> 2) launch qconf -msconf and change schedd_job_info to true:
>> >> qconf -msconf
>> >>
>> >> Rayson
>> >>
>> >> ==================================================
>> >> Open Grid Scheduler - The Official Open Source Grid Engine
>> >> http://gridscheduler.sourceforge.net/
>> >>
>> >>
>> >> On Wed, Jul 24, 2013 at 6:05 AM, Jacob Barhak <jacob.barhak_at_gmail.com>
>> >> wrote:
>> >> > Thanks Again Rayson,
>> >> >
>> >> > Yet even with your generous help I am still stuck. Perhaps you can
>> >> > look
>> >> > at
>> >> > what I am doing and correct me.
>> >> >
>> >> > I tried running the commands you suggested to reconfigure the
>> >> > scheduler
>> >> > and
>> >> > it seems the system hangs on me.
>> >> >
>> >> > Here is what I did:
>> >> >
>> >> > 1. Created a 1 node cluster.
>> >> > 2. Copied the configuration file to my machine using:
>> >> > starcluster get myscluster
>> >> > /opt/sge6/default/common/sched_configuration
>> >> > .
>> >> > 3. Modified line 12 in sched_configuration to read
>> >> > schedd_job_info TRUE
>> >> > 4. Copied the modified sched_configuration file to the default local
>> >> > directory on the cluster using:
>> >> > starcluster put mycluster sched_configuration .
>> >> > 5. run the configuration command you suggested:
>> >> > starcluster sshmaster mycluster "qconf -msconf sched_configuration"
>> >> >
>> >> > The system hangs in the last stage and does not return unless I press
>> >> > control break - even control C does not work. I waited a few minutes
>> >> > then
>> >> > terminated the cluster. I double checked this behavior using the -u
>> >> > root
>> >> > argument when running the commands to ensure root privileges.
>> >> >
>> >> > I am using a windows 7 machine to issue those commands. I use the
>> >> > PythonXY
>> >> > distribution and installed starcluster using easy_install. I am
>> >> > providing
>> >> > this information to see if there is anything wrong with my system
>> >> > compatibility-wise.
>> >> >
>> >> > I am attaching the configuration file and the transcript.
>> >> >
>> >> > I am getting funny characters and it seems the qconf command I issued
>> >> > does
>> >> > not recognize the change in schedd_job_info.
>> >> >
>> >> > Is there any other way I can find out what the problem is - If you
>> >> > recall I
>> >> > am trying to figure out why jobs in the queue are not dispatched to
>> >> > run
>> >> > and
>> >> > keep waiting forever. This happens after a few hundred jobs I am
>> >> > sending.
>> >> >
>> >> > I hope you could find time to look at this once more, or perhaps
>> >> > someone
>> >> > else can help.
>> >> >
>> >> >
>> >> > Jacob
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > On Tue, Jul 23, 2013 at 6:44 PM, Rayson Ho <raysonlogin_at_gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> On Tue, Jul 23, 2013 at 2:08 AM, Jacob Barhak
>> >> >> <jacob.barhak_at_gmail.com>
>> >> >> wrote:
>> >> >> > The second issue, however, is reproducible. I just tried again a
>> >> >> > 20
>> >> >> > node
>> >> >> > cluster.
>> >> >>
>> >> >> Yes, it will always be reproducible and only can start 20 on-demand
>> >> >> instances in a region. Again, you need to fill out the "Request to
>> >> >> Increase Amazon EC2 Instance Limit" form if you want more than 20:
>> >> >>
>> >> >> https://aws.amazon.com/contact-us/ec2-request/
>> >> >>
>> >> >>
>> >> >> > This time I posted 2497 jobs to the queue - each about 1 minute
>> >> >> > long.
>> >> >> > The
>> >> >> > system stopped sending jobs the queue about half point. There were
>> >> >> > 1310
>> >> >> > jobs
>> >> >> > in the queue when the system stopped sending more jobs.
>> >> >> > When running "qstat -j" the system provided the following answer:
>> >> >> >
>> >> >> > scheduling info: (Collecting of scheduler job
>> >> >> > information
>> >> >> > is
>> >> >> > turned off)
>> >> >>
>> >> >> That's my fault -- I forgot to point out that the scheduler info is
>> >> >> off by default, and you need to run "qconf -msconf", and change the
>> >> >> parameter "schedd_job_info" to true. See:
>> >> >>
>> >> >>
>> >> >> http://gridscheduler.sourceforge.net/htmlman/htmlman5/sched_conf.html
>> >> >>
>> >> >> Rayson
>> >> >>
>> >> >> ==================================================
>> >> >> Open Grid Scheduler - The Official Open Source Grid Engine
>> >> >> http://gridscheduler.sourceforge.net/
>> >> >>
>> >> >>
>> >> >>
>> >> >> >
>> >> >> > I am not familiar with the error messages, yet it seems I need to
>> >> >> > enable
>> >> >> > something that is turned off. If there is quick obvious solution
>> >> >> > for
>> >> >> > this
>> >> >> > please let me know what to do, otherwise are there any other
>> >> >> > diagnostics
>> >> >> > tools I can use?
>> >> >> >
>> >> >> > Again, thanks for the quick reply and I hope this is an easy fix.
>> >> >> >
>> >> >> > Jacob
>> >> >> >
>> >> >> >
>> >> >> > On Mon, Jul 22, 2013 at 2:51 PM, Rayson Ho <raysonlogin_at_gmail.com>
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> On Sun, Jul 21, 2013 at 1:40 AM, Jacob Barhak
>> >> >> >> <jacob.barhak_at_gmail.com>
>> >> >> >> wrote:
>> >> >> >> > 1. Sometime starcluster is unable to properly connect the
>> >> >> >> > instances
>> >> >> >> > on
>> >> >> >> > the
>> >> >> >> > start command and cannot mount /home. It happened once when I
>> >> >> >> > asked
>> >> >> >> > for
>> >> >> >> > 5
>> >> >> >> > m1.small machines and when I terminated this cluster and
>> >> >> >> > started
>> >> >> >> > again
>> >> >> >> > things went fine. Is this intermittent due to cloud traffic or
>> >> >> >> > is
>> >> >> >> > this a
>> >> >> >> > bug? Is there a way for me to check why?
>> >> >> >>
>> >> >> >> Can be a problem with the actual hardware - can you ssh into the
>> >> >> >> node
>> >> >> >> and manually mount /home by hand next time you encounter this
>> >> >> >> issue
>> >> >> >> and see if you can reproduce it when run interactively?
>> >> >> >>
>> >> >> >>
>> >> >> >> > 2. After launching 20 c1.xlarge machines and running about
>> >> >> >> > 2500
>> >> >> >> > jobs,
>> >> >> >> > each
>> >> >> >> > about 5 minutes long, I encountered a problem after and hour or
>> >> >> >> > so.
>> >> >> >> > It
>> >> >> >> > seems
>> >> >> >> > that SGE stopped sending jobs from to the queue to the
>> >> >> >> > instances.
>> >> >> >> > No
>> >> >> >> > error
>> >> >> >> > was found and the queue showed about 850 pending jobs. This did
>> >> >> >> > not
>> >> >> >> > change
>> >> >> >> > for a while and I could not find any failure with qstat or
>> >> >> >> > qhost.
>> >> >> >> > No
>> >> >> >> > jobs
>> >> >> >> > were running on any nodes and I waited a while for these to
>> >> >> >> > start
>> >> >> >> > without
>> >> >> >> > success. I tried the same thing again after a few hours and it
>> >> >> >> > seems
>> >> >> >> > that
>> >> >> >> > the cluster stops sending jobs from the queue after about 1600
>> >> >> >> > jobs
>> >> >> >> > have
>> >> >> >> > been submitted. This does not happen when SGE is installed on a
>> >> >> >> > single
>> >> >> >> > Ubuntu machine I have at home. I am trying to figure out what
>> >> >> >> > is
>> >> >> >> > wrong.
>> >> >> >> > Did
>> >> >> >> > you impose some limit on the number of jobs? Can this be fixed?
>> >> >> >> > I
>> >> >> >> > really
>> >> >> >> > need to submit many jobs - tens of thousands jobs in my full
>> >> >> >> > runs.
>> >> >> >> > This
>> >> >> >> > one
>> >> >> >> > was relatively small and still did not pass.
>> >> >> >>
>> >> >> >> Looks like SGE thinks that the nodes are down or in alarm or
>> >> >> >> error
>> >> >> >> state? To find out why SGE thinks there are no nodes available,
>> >> >> >> run:
>> >> >> >> qstat -j
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> > 3. I tried to start Star Cluster with 50 nodes and got an error
>> >> >> >> > about
>> >> >> >> > exceeding a quota of 20. Is it your quota or Amazon quota? Are
>> >> >> >> > there
>> >> >> >> > any
>> >> >> >> > other restrictions I should be aware of at the beginning? Also
>> >> >> >> > after
>> >> >> >> > the
>> >> >> >> > system is unable start the cluster it thinks it is still
>> >> >> >> > running
>> >> >> >> > and
>> >> >> >> > a
>> >> >> >> > terminate command is needed before another start can be issued
>> >> >> >> > -
>> >> >> >> > even
>> >> >> >> > though
>> >> >> >> > nothing got started.
>> >> >> >>
>> >> >> >> It's Amazon's quota. 50 is considered small by AWS standard, and
>> >> >> >> they
>> >> >> >> can give it to you almost right away... You need to request AWS
>> >> >> >> to
>> >> >> >> give you a higher limit:
>> >> >> >> https://aws.amazon.com/contact-us/ec2-request/
>> >> >> >>
>> >> >> >> Note that last year we requested for 10,000 nodes and the whole
>> >> >> >> process took less than 1 day:
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> http://blogs.scalablelogic.com/2012/11/running-10000-node-grid-engine-cluster.html
>> >> >> >>
>> >> >> >> Rayson
>> >> >> >>
>> >> >> >> ==================================================
>> >> >> >> Open Grid Scheduler - The Official Open Source Grid Engine
>> >> >> >> http://gridscheduler.sourceforge.net/
>> >> >> >>
>> >> >> >>
>> >> >> >> >
>> >> >> >> > This all happened on us-west-2 with the help of star cluster
>> >> >> >> > 0.93.3
>> >> >> >> > and
>> >> >> >> > the
>> >> >> >> > Anaconda AMI - ami-a4d64194
>> >> >> >> >
>> >> >> >> > Here is some more information on what I am doing to help you
>> >> >> >> > answer
>> >> >> >> > the
>> >> >> >> > above.
>> >> >> >> >
>> >> >> >> > I am running Monte Carlo simulations to simulate chronic
>> >> >> >> > disease
>> >> >> >> > progression. I am using MIST to run over the cloud:
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > https://github.com/scipy/scipy2013_talks/blob/master/talks/jacob_barhak/readme.md
>> >> >> >> >
>> >> >> >> > The Reference Model is what I am running using MIST:
>> >> >> >> >
>> >> >> >> > http://youtu.be/7qxPSgINaD8
>> >> >> >> >
>> >> >> >> > I am launching many simulations in parallel and it takes me
>> >> >> >> > days
>> >> >> >> > on a
>> >> >> >> > single
>> >> >> >> > 8 core machine. The cloud allows me to cut down this time to
>> >> >> >> > hours.
>> >> >> >> > This
>> >> >> >> > is
>> >> >> >> > why star cluster is so useful. In the past I did this over
>> >> >> >> > other
>> >> >> >> > clusters
>> >> >> >> > yet the cloud is still new to me.
>> >> >> >> >
>> >> >> >> > I will appreciate any recommendations I can get from you to
>> >> >> >> > improve
>> >> >> >> > the
>> >> >> >> > behaviors I am experiencing.
>> >> >>
>> >> >> http://commons.wikimedia.org/wiki/User:Raysonho
>> >> >
>> >> >
>> >
>> >
>
>
Received on Thu Jul 25 2013 - 15:49:48 EDT
This archive was generated by
hypermail 2.3.0.