Re: Star cluster issues when running MIST over the cloud
You may want to export your /mnt, which has over 400GB. And if you
RAID the remaining 3 drives, then you can get the full 1700GB of local
storage. EBS is not free, while the ephemeral (local) storage is free,
but the only catch is that when you terminate the instance, then the
storage will be gone.
Rayson
==================================================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/
On Thu, Jul 25, 2013 at 5:10 PM, Jacob Barhak <jacob.barhak_at_gmail.com> wrote:
> Hi Rayson,
>
> Yes. It seems you narrowed down the problem to disk space. This may explain
> well all the weird behavior.
>
> When I run the following command:
> df -h /home
> I am getting the following output:
> Filesystem Size Used Avail Use% Mounted on
> /dev/xvda1 9.9G 5.0G 4.4G 54% /
>
> This means even with a c1.xlarge instance my shared NFS space in /home does
> not grow to the 420Gb promised. I use /home for storing and running my
> system.
>
> I really do not need EBS since I do not need persistent storage. Only in the
> farther future I may need more than 400gb of space and I can also avoid this
> if I really want to.
>
> Is there any quick and painless way for me to fix this in the cluster
> configuration file or after the cluster is up and running? I am looking for
> something simple such as a line or two of code or a line in the
> configuration file. The idea would be to incorporate this line or two with
> the documentation.
>
> I hope there is a quick solution that closes this issue.
>
> Jacob
>
>
> On Thu, Jul 25, 2013 at 2:49 PM, Rayson Ho <raysonlogin_at_gmail.com> wrote:
>>
>> On Thu, Jul 25, 2013 at 2:54 PM, Jacob Barhak <jacob.barhak_at_gmail.com>
>> wrote:
>> > 3. I am not using EBS. I am just using the NFS shared disk space on the
>> > master node. If I understand correctly, it should be sufficient for my
>> > needs
>> > now - I will check the disk space issue and get back to you. never the
>> > less,
>> > in the farther future I may need to use EBS so its good your mentioned
>> > this.
>>
>> While you didn't explicitly ask for EBS, if the AMI is "EBS-backed",
>> then your root filesystem is still on a EBS volume.
>>
>> Rayson
>>
>> ==================================================
>> Open Grid Scheduler - The Official Open Source Grid Engine
>> http://gridscheduler.sourceforge.net/
>>
>>
>> >
>> > I hope the next post will have a solution.
>> >
>> > Jacob
>> >
>> >
>> >
>> > On Wed, Jul 24, 2013 at 9:11 PM, Rayson Ho <raysonlogin_at_gmail.com>
>> > wrote:
>> >>
>> >> On Wed, Jul 24, 2013 at 6:47 PM, Jacob Barhak <jacob.barhak_at_gmail.com>
>> >> wrote:
>> >> > This time simulations stopped at 1331 jobs waiting in queue. I run
>> >> > qstat
>> >> > -j
>> >> > and it provided me with the attached output. Basically, it seems that
>> >> > the
>> >> > system does not handle dependencies I have in the simulation for some
>> >> > reason
>> >> > or it just overloads. It seems to stop all queues at some point. At
>> >> > least
>> >> > this is what I understand from the output.
>> >>
>> >> The message "queue instance XX dropped because it is temporarily not
>> >> available" is usually a problem with the execution host. So if XX =
>> >> "all.q_at_node013", then ssh into node013, and then check the status of
>> >> the machine (free disk space, sge_execd process running, etc), and
>> >> also read the execd log file at:
>> >>
>> >> /opt/sge6/default/spool/exec_spool_local/<host name of the node you
>> >> want to debug>/messages
>> >>
>> >> And also read the cluster master's log file:
>> >>
>> >> /opt/sge6/default/spool/qmaster/messages
>> >>
>> >> Finally, check the SGE status of the node, and from the "states"
>> >> column, you can find out what is going on.
>> >>
>> >> root_at_master:/opt/sge6/default/spool/exec_spool_local/master# qstat -f
>> >> queuename qtype resv/used/tot. load_avg arch
>> >> states
>> >>
>> >>
>> >> ---------------------------------------------------------------------------------
>> >> all.q_at_master BIP 0/0/8 0.08 linux-x64
>> >>
>> >>
>> >>
>> >> > This may explain why this happens around 1300 jobs left. My
>> >> > simulation
>> >> > is
>> >> > basically composed of 3 parts, the last 2 parts consist of 1248 + 1
>> >> > jobs
>> >> > that depend on the first 1248 jobs. My assumption - based on the
>> >> > output
>> >> > are
>> >> > that dependencies get lost or perhaps overload the system - yet I may
>> >> > be
>> >> > wrong and this is why I am asking the experts.
>> >>
>> >> Did you specify the job dependencies when you submit jobs??
>> >>
>> >> See the "-hold_jid" parameter:
>> >> http://gridscheduler.sourceforge.net/htmlman/htmlman1/qsub.html
>> >>
>> >>
>> >>
>> >>
>> >> > Note that this does not happen with smaller simulations such as my
>> >> > test
>> >> > suite. This also does not happen on an SGE cluster I installed on a
>> >> > single
>> >> > machine. So is this problem an SGE issue or should I report it to
>> >> > this
>> >> > group
>> >> > in Star Cluster?
>> >>
>> >> I have a feeling that smaller simulations generate less output data,
>> >> and that's why the disk filled up issue is not affecting you! :-D
>> >>
>> >>
>> >> > Also and this is a different issue, I tried a larger simulation for
>> >> > about
>> >> > 25K jobs. It seems that the system overloads at some point and gives
>> >> > me
>> >> > the
>> >> > following error:
>> >> > Unable to run job: rule "default rule (spool dir)" in spooling
>> >> > context
>> >> > "flatfile
>> >> > spooling" failed writing an object
>> >> > job 11618 was rejected cause it can't be written: No space left on
>> >> > device.
>> >> > Exiting.
>> >> >
>> >> > In the above simulations the cluster was created using:
>> >> > starcluster start -s 20 -i c1.xlarge mycluster
>> >>
>> >> I think it is a very important hint!
>> >>
>> >> The c1.xlarge has 4 disks, but the OS is still sitting in the smaller
>> >> disk partition. For the EBS AMIs, you should have this mapping:
>> >>
>> >> # mount
>> >> ...
>> >> /dev/xvda1 on / type ext4 (rw)
>> >> /dev/xvdb1 on /mnt type ext3 (rw,_netdev)
>> >>
>> >> On my c1.xlarge instance running ami-765b3e1f, the OS partition (/)
>> >> has 6.6GB free, which is enough for the basic stuff, but if you have
>> >> output data, then 6.7GB these days is just too small!
>> >>
>> >> On the other hand, /mnt has 401GB free, so that's where you should be
>> >> storing your data.
>> >>
>> >> In case you didn't know, run "df -m /" and "df -m /mnt" to find the
>> >> amount of free disk space on your master node.
>> >>
>> >>
>> >> > I assume that the a master with 7gb memory and 4x420GB disk space is
>> >> > sufficient to support the operations I request. Outside the cloud I
>> >> > have
>> >> > a
>> >> > single physical machine with 16Gb memory and less than 0.5Tb disk
>> >> > space
>> >> > allocated that handles this entire simulation on its own without
>> >> > going
>> >> > to
>> >> > the cloud - yet it takes much more time and therefore the ability to
>> >> > run
>> >> > on
>> >> > the cloud is important.
>> >> >
>> >> > If anyone can help me fix those issues I would be grateful since I
>> >> > wish
>> >> > to
>> >> > release a new version of my code that can run simulations on the
>> >> > cloud
>> >> > in
>> >> > bigger scales. So far I can run simulations only on small scale
>> >> > clusters
>> >> > on
>> >> > the cloud due to the above issues.
>> >>
>> >> Let me know how it goes, and may be I can IM you and/or fix your
>> >> cluster remotely for free if you are working for a non-profit
>> >> organization. (Sorry, our clients would complain if we just give away
>> >> free support to selected few users...)
>> >>
>> >> Rayson
>> >>
>> >> ==================================================
>> >> Open Grid Scheduler - The Official Open Source Grid Engine
>> >> http://gridscheduler.sourceforge.net/
>> >>
>> >> >
>> >> > Again Rayson, thanks for your guidance and hopefully there is a
>> >> > simple
>> >> > explanation/fix for this.
>> >> >
>> >> >
>> >> > Jacob
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > On Wed, Jul 24, 2013 at 7:29 AM, Rayson Ho <raysonlogin_at_gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> "qconf -msconf" will try to open the editor, which is vi. And going
>> >> >> through starcluster sshmaster to launch vi may not be the best
>> >> >> approach.
>> >> >>
>> >> >> This is what I just tested and worked for me:
>> >> >>
>> >> >> 1) ssh into the master by running:
>> >> >> starcluster sshmaster myscluster
>> >> >>
>> >> >> 2) launch qconf -msconf and change schedd_job_info to true:
>> >> >> qconf -msconf
>> >> >>
>> >> >> Rayson
>> >> >>
>> >> >> ==================================================
>> >> >> Open Grid Scheduler - The Official Open Source Grid Engine
>> >> >> http://gridscheduler.sourceforge.net/
>> >> >>
>> >> >>
>> >> >> On Wed, Jul 24, 2013 at 6:05 AM, Jacob Barhak
>> >> >> <jacob.barhak_at_gmail.com>
>> >> >> wrote:
>> >> >> > Thanks Again Rayson,
>> >> >> >
>> >> >> > Yet even with your generous help I am still stuck. Perhaps you can
>> >> >> > look
>> >> >> > at
>> >> >> > what I am doing and correct me.
>> >> >> >
>> >> >> > I tried running the commands you suggested to reconfigure the
>> >> >> > scheduler
>> >> >> > and
>> >> >> > it seems the system hangs on me.
>> >> >> >
>> >> >> > Here is what I did:
>> >> >> >
>> >> >> > 1. Created a 1 node cluster.
>> >> >> > 2. Copied the configuration file to my machine using:
>> >> >> > starcluster get myscluster
>> >> >> > /opt/sge6/default/common/sched_configuration
>> >> >> > .
>> >> >> > 3. Modified line 12 in sched_configuration to read
>> >> >> > schedd_job_info TRUE
>> >> >> > 4. Copied the modified sched_configuration file to the default
>> >> >> > local
>> >> >> > directory on the cluster using:
>> >> >> > starcluster put mycluster sched_configuration .
>> >> >> > 5. run the configuration command you suggested:
>> >> >> > starcluster sshmaster mycluster "qconf -msconf
>> >> >> > sched_configuration"
>> >> >> >
>> >> >> > The system hangs in the last stage and does not return unless I
>> >> >> > press
>> >> >> > control break - even control C does not work. I waited a few
>> >> >> > minutes
>> >> >> > then
>> >> >> > terminated the cluster. I double checked this behavior using the
>> >> >> > -u
>> >> >> > root
>> >> >> > argument when running the commands to ensure root privileges.
>> >> >> >
>> >> >> > I am using a windows 7 machine to issue those commands. I use the
>> >> >> > PythonXY
>> >> >> > distribution and installed starcluster using easy_install. I am
>> >> >> > providing
>> >> >> > this information to see if there is anything wrong with my system
>> >> >> > compatibility-wise.
>> >> >> >
>> >> >> > I am attaching the configuration file and the transcript.
>> >> >> >
>> >> >> > I am getting funny characters and it seems the qconf command I
>> >> >> > issued
>> >> >> > does
>> >> >> > not recognize the change in schedd_job_info.
>> >> >> >
>> >> >> > Is there any other way I can find out what the problem is - If you
>> >> >> > recall I
>> >> >> > am trying to figure out why jobs in the queue are not dispatched
>> >> >> > to
>> >> >> > run
>> >> >> > and
>> >> >> > keep waiting forever. This happens after a few hundred jobs I am
>> >> >> > sending.
>> >> >> >
>> >> >> > I hope you could find time to look at this once more, or perhaps
>> >> >> > someone
>> >> >> > else can help.
>> >> >> >
>> >> >> >
>> >> >> > Jacob
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > On Tue, Jul 23, 2013 at 6:44 PM, Rayson Ho <raysonlogin_at_gmail.com>
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> On Tue, Jul 23, 2013 at 2:08 AM, Jacob Barhak
>> >> >> >> <jacob.barhak_at_gmail.com>
>> >> >> >> wrote:
>> >> >> >> > The second issue, however, is reproducible. I just tried again
>> >> >> >> > a
>> >> >> >> > 20
>> >> >> >> > node
>> >> >> >> > cluster.
>> >> >> >>
>> >> >> >> Yes, it will always be reproducible and only can start 20
>> >> >> >> on-demand
>> >> >> >> instances in a region. Again, you need to fill out the "Request
>> >> >> >> to
>> >> >> >> Increase Amazon EC2 Instance Limit" form if you want more than
>> >> >> >> 20:
>> >> >> >>
>> >> >> >> https://aws.amazon.com/contact-us/ec2-request/
>> >> >> >>
>> >> >> >>
>> >> >> >> > This time I posted 2497 jobs to the queue - each about 1 minute
>> >> >> >> > long.
>> >> >> >> > The
>> >> >> >> > system stopped sending jobs the queue about half point. There
>> >> >> >> > were
>> >> >> >> > 1310
>> >> >> >> > jobs
>> >> >> >> > in the queue when the system stopped sending more jobs.
>> >> >> >> > When running "qstat -j" the system provided the following
>> >> >> >> > answer:
>> >> >> >> >
>> >> >> >> > scheduling info: (Collecting of scheduler job
>> >> >> >> > information
>> >> >> >> > is
>> >> >> >> > turned off)
>> >> >> >>
>> >> >> >> That's my fault -- I forgot to point out that the scheduler info
>> >> >> >> is
>> >> >> >> off by default, and you need to run "qconf -msconf", and change
>> >> >> >> the
>> >> >> >> parameter "schedd_job_info" to true. See:
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> http://gridscheduler.sourceforge.net/htmlman/htmlman5/sched_conf.html
>> >> >> >>
>> >> >> >> Rayson
>> >> >> >>
>> >> >> >> ==================================================
>> >> >> >> Open Grid Scheduler - The Official Open Source Grid Engine
>> >> >> >> http://gridscheduler.sourceforge.net/
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> >
>> >> >> >> > I am not familiar with the error messages, yet it seems I need
>> >> >> >> > to
>> >> >> >> > enable
>> >> >> >> > something that is turned off. If there is quick obvious
>> >> >> >> > solution
>> >> >> >> > for
>> >> >> >> > this
>> >> >> >> > please let me know what to do, otherwise are there any other
>> >> >> >> > diagnostics
>> >> >> >> > tools I can use?
>> >> >> >> >
>> >> >> >> > Again, thanks for the quick reply and I hope this is an easy
>> >> >> >> > fix.
>> >> >> >> >
>> >> >> >> > Jacob
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > On Mon, Jul 22, 2013 at 2:51 PM, Rayson Ho
>> >> >> >> > <raysonlogin_at_gmail.com>
>> >> >> >> > wrote:
>> >> >> >> >>
>> >> >> >> >> On Sun, Jul 21, 2013 at 1:40 AM, Jacob Barhak
>> >> >> >> >> <jacob.barhak_at_gmail.com>
>> >> >> >> >> wrote:
>> >> >> >> >> > 1. Sometime starcluster is unable to properly connect the
>> >> >> >> >> > instances
>> >> >> >> >> > on
>> >> >> >> >> > the
>> >> >> >> >> > start command and cannot mount /home. It happened once when
>> >> >> >> >> > I
>> >> >> >> >> > asked
>> >> >> >> >> > for
>> >> >> >> >> > 5
>> >> >> >> >> > m1.small machines and when I terminated this cluster and
>> >> >> >> >> > started
>> >> >> >> >> > again
>> >> >> >> >> > things went fine. Is this intermittent due to cloud traffic
>> >> >> >> >> > or
>> >> >> >> >> > is
>> >> >> >> >> > this a
>> >> >> >> >> > bug? Is there a way for me to check why?
>> >> >> >> >>
>> >> >> >> >> Can be a problem with the actual hardware - can you ssh into
>> >> >> >> >> the
>> >> >> >> >> node
>> >> >> >> >> and manually mount /home by hand next time you encounter this
>> >> >> >> >> issue
>> >> >> >> >> and see if you can reproduce it when run interactively?
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> > 2. After launching 20 c1.xlarge machines and running about
>> >> >> >> >> > 2500
>> >> >> >> >> > jobs,
>> >> >> >> >> > each
>> >> >> >> >> > about 5 minutes long, I encountered a problem after and hour
>> >> >> >> >> > or
>> >> >> >> >> > so.
>> >> >> >> >> > It
>> >> >> >> >> > seems
>> >> >> >> >> > that SGE stopped sending jobs from to the queue to the
>> >> >> >> >> > instances.
>> >> >> >> >> > No
>> >> >> >> >> > error
>> >> >> >> >> > was found and the queue showed about 850 pending jobs. This
>> >> >> >> >> > did
>> >> >> >> >> > not
>> >> >> >> >> > change
>> >> >> >> >> > for a while and I could not find any failure with qstat or
>> >> >> >> >> > qhost.
>> >> >> >> >> > No
>> >> >> >> >> > jobs
>> >> >> >> >> > were running on any nodes and I waited a while for these to
>> >> >> >> >> > start
>> >> >> >> >> > without
>> >> >> >> >> > success. I tried the same thing again after a few hours and
>> >> >> >> >> > it
>> >> >> >> >> > seems
>> >> >> >> >> > that
>> >> >> >> >> > the cluster stops sending jobs from the queue after about
>> >> >> >> >> > 1600
>> >> >> >> >> > jobs
>> >> >> >> >> > have
>> >> >> >> >> > been submitted. This does not happen when SGE is installed
>> >> >> >> >> > on a
>> >> >> >> >> > single
>> >> >> >> >> > Ubuntu machine I have at home. I am trying to figure out
>> >> >> >> >> > what
>> >> >> >> >> > is
>> >> >> >> >> > wrong.
>> >> >> >> >> > Did
>> >> >> >> >> > you impose some limit on the number of jobs? Can this be
>> >> >> >> >> > fixed?
>> >> >> >> >> > I
>> >> >> >> >> > really
>> >> >> >> >> > need to submit many jobs - tens of thousands jobs in my full
>> >> >> >> >> > runs.
>> >> >> >> >> > This
>> >> >> >> >> > one
>> >> >> >> >> > was relatively small and still did not pass.
>> >> >> >> >>
>> >> >> >> >> Looks like SGE thinks that the nodes are down or in alarm or
>> >> >> >> >> error
>> >> >> >> >> state? To find out why SGE thinks there are no nodes
>> >> >> >> >> available,
>> >> >> >> >> run:
>> >> >> >> >> qstat -j
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> > 3. I tried to start Star Cluster with 50 nodes and got an
>> >> >> >> >> > error
>> >> >> >> >> > about
>> >> >> >> >> > exceeding a quota of 20. Is it your quota or Amazon quota?
>> >> >> >> >> > Are
>> >> >> >> >> > there
>> >> >> >> >> > any
>> >> >> >> >> > other restrictions I should be aware of at the beginning?
>> >> >> >> >> > Also
>> >> >> >> >> > after
>> >> >> >> >> > the
>> >> >> >> >> > system is unable start the cluster it thinks it is still
>> >> >> >> >> > running
>> >> >> >> >> > and
>> >> >> >> >> > a
>> >> >> >> >> > terminate command is needed before another start can be
>> >> >> >> >> > issued
>> >> >> >> >> > -
>> >> >> >> >> > even
>> >> >> >> >> > though
>> >> >> >> >> > nothing got started.
>> >> >> >> >>
>> >> >> >> >> It's Amazon's quota. 50 is considered small by AWS standard,
>> >> >> >> >> and
>> >> >> >> >> they
>> >> >> >> >> can give it to you almost right away... You need to request
>> >> >> >> >> AWS
>> >> >> >> >> to
>> >> >> >> >> give you a higher limit:
>> >> >> >> >> https://aws.amazon.com/contact-us/ec2-request/
>> >> >> >> >>
>> >> >> >> >> Note that last year we requested for 10,000 nodes and the
>> >> >> >> >> whole
>> >> >> >> >> process took less than 1 day:
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> http://blogs.scalablelogic.com/2012/11/running-10000-node-grid-engine-cluster.html
>> >> >> >> >>
>> >> >> >> >> Rayson
>> >> >> >> >>
>> >> >> >> >> ==================================================
>> >> >> >> >> Open Grid Scheduler - The Official Open Source Grid Engine
>> >> >> >> >> http://gridscheduler.sourceforge.net/
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> > This all happened on us-west-2 with the help of star cluster
>> >> >> >> >> > 0.93.3
>> >> >> >> >> > and
>> >> >> >> >> > the
>> >> >> >> >> > Anaconda AMI - ami-a4d64194
>> >> >> >> >> >
>> >> >> >> >> > Here is some more information on what I am doing to help you
>> >> >> >> >> > answer
>> >> >> >> >> > the
>> >> >> >> >> > above.
>> >> >> >> >> >
>> >> >> >> >> > I am running Monte Carlo simulations to simulate chronic
>> >> >> >> >> > disease
>> >> >> >> >> > progression. I am using MIST to run over the cloud:
>> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> > https://github.com/scipy/scipy2013_talks/blob/master/talks/jacob_barhak/readme.md
>> >> >> >> >> >
>> >> >> >> >> > The Reference Model is what I am running using MIST:
>> >> >> >> >> >
>> >> >> >> >> > http://youtu.be/7qxPSgINaD8
>> >> >> >> >> >
>> >> >> >> >> > I am launching many simulations in parallel and it takes me
>> >> >> >> >> > days
>> >> >> >> >> > on a
>> >> >> >> >> > single
>> >> >> >> >> > 8 core machine. The cloud allows me to cut down this time to
>> >> >> >> >> > hours.
>> >> >> >> >> > This
>> >> >> >> >> > is
>> >> >> >> >> > why star cluster is so useful. In the past I did this over
>> >> >> >> >> > other
>> >> >> >> >> > clusters
>> >> >> >> >> > yet the cloud is still new to me.
>> >> >> >> >> >
>> >> >> >> >> > I will appreciate any recommendations I can get from you to
>> >> >> >> >> > improve
>> >> >> >> >> > the
>> >> >> >> >> > behaviors I am experiencing.
>> >> >> >>
>> >> >> >> http://commons.wikimedia.org/wiki/User:Raysonho
>> >> >> >
>> >> >> >
>> >> >
>> >> >
>> >
>> >
>
>
Received on Fri Jul 26 2013 - 10:27:55 EDT
This archive was generated by
hypermail 2.3.0.