Re: Star cluster issues when running MIST over the cloud
"qconf -msconf" will try to open the editor, which is vi. And going
through starcluster sshmaster to launch vi may not be the best
approach.
This is what I just tested and worked for me:
1) ssh into the master by running:
starcluster sshmaster myscluster
2) launch qconf -msconf and change schedd_job_info to true:
qconf -msconf
Rayson
==================================================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/
On Wed, Jul 24, 2013 at 6:05 AM, Jacob Barhak <jacob.barhak_at_gmail.com> wrote:
> Thanks Again Rayson,
>
> Yet even with your generous help I am still stuck. Perhaps you can look at
> what I am doing and correct me.
>
> I tried running the commands you suggested to reconfigure the scheduler and
> it seems the system hangs on me.
>
> Here is what I did:
>
> 1. Created a 1 node cluster.
> 2. Copied the configuration file to my machine using:
> starcluster get myscluster /opt/sge6/default/common/sched_configuration .
> 3. Modified line 12 in sched_configuration to read
> schedd_job_info TRUE
> 4. Copied the modified sched_configuration file to the default local
> directory on the cluster using:
> starcluster put mycluster sched_configuration .
> 5. run the configuration command you suggested:
> starcluster sshmaster mycluster "qconf -msconf sched_configuration"
>
> The system hangs in the last stage and does not return unless I press
> control break - even control C does not work. I waited a few minutes then
> terminated the cluster. I double checked this behavior using the -u root
> argument when running the commands to ensure root privileges.
>
> I am using a windows 7 machine to issue those commands. I use the PythonXY
> distribution and installed starcluster using easy_install. I am providing
> this information to see if there is anything wrong with my system
> compatibility-wise.
>
> I am attaching the configuration file and the transcript.
>
> I am getting funny characters and it seems the qconf command I issued does
> not recognize the change in schedd_job_info.
>
> Is there any other way I can find out what the problem is - If you recall I
> am trying to figure out why jobs in the queue are not dispatched to run and
> keep waiting forever. This happens after a few hundred jobs I am sending.
>
> I hope you could find time to look at this once more, or perhaps someone
> else can help.
>
>
> Jacob
>
>
>
>
> On Tue, Jul 23, 2013 at 6:44 PM, Rayson Ho <raysonlogin_at_gmail.com> wrote:
>>
>> On Tue, Jul 23, 2013 at 2:08 AM, Jacob Barhak <jacob.barhak_at_gmail.com>
>> wrote:
>> > The second issue, however, is reproducible. I just tried again a 20 node
>> > cluster.
>>
>> Yes, it will always be reproducible and only can start 20 on-demand
>> instances in a region. Again, you need to fill out the "Request to
>> Increase Amazon EC2 Instance Limit" form if you want more than 20:
>>
>> https://aws.amazon.com/contact-us/ec2-request/
>>
>>
>> > This time I posted 2497 jobs to the queue - each about 1 minute long.
>> > The
>> > system stopped sending jobs the queue about half point. There were 1310
>> > jobs
>> > in the queue when the system stopped sending more jobs.
>> > When running "qstat -j" the system provided the following answer:
>> >
>> > scheduling info: (Collecting of scheduler job information is
>> > turned off)
>>
>> That's my fault -- I forgot to point out that the scheduler info is
>> off by default, and you need to run "qconf -msconf", and change the
>> parameter "schedd_job_info" to true. See:
>>
>> http://gridscheduler.sourceforge.net/htmlman/htmlman5/sched_conf.html
>>
>> Rayson
>>
>> ==================================================
>> Open Grid Scheduler - The Official Open Source Grid Engine
>> http://gridscheduler.sourceforge.net/
>>
>>
>>
>> >
>> > I am not familiar with the error messages, yet it seems I need to enable
>> > something that is turned off. If there is quick obvious solution for
>> > this
>> > please let me know what to do, otherwise are there any other diagnostics
>> > tools I can use?
>> >
>> > Again, thanks for the quick reply and I hope this is an easy fix.
>> >
>> > Jacob
>> >
>> >
>> > On Mon, Jul 22, 2013 at 2:51 PM, Rayson Ho <raysonlogin_at_gmail.com>
>> > wrote:
>> >>
>> >> On Sun, Jul 21, 2013 at 1:40 AM, Jacob Barhak <jacob.barhak_at_gmail.com>
>> >> wrote:
>> >> > 1. Sometime starcluster is unable to properly connect the instances
>> >> > on
>> >> > the
>> >> > start command and cannot mount /home. It happened once when I asked
>> >> > for
>> >> > 5
>> >> > m1.small machines and when I terminated this cluster and started
>> >> > again
>> >> > things went fine. Is this intermittent due to cloud traffic or is
>> >> > this a
>> >> > bug? Is there a way for me to check why?
>> >>
>> >> Can be a problem with the actual hardware - can you ssh into the node
>> >> and manually mount /home by hand next time you encounter this issue
>> >> and see if you can reproduce it when run interactively?
>> >>
>> >>
>> >> > 2. After launching 20 c1.xlarge machines and running about 2500
>> >> > jobs,
>> >> > each
>> >> > about 5 minutes long, I encountered a problem after and hour or so.
>> >> > It
>> >> > seems
>> >> > that SGE stopped sending jobs from to the queue to the instances. No
>> >> > error
>> >> > was found and the queue showed about 850 pending jobs. This did not
>> >> > change
>> >> > for a while and I could not find any failure with qstat or qhost. No
>> >> > jobs
>> >> > were running on any nodes and I waited a while for these to start
>> >> > without
>> >> > success. I tried the same thing again after a few hours and it seems
>> >> > that
>> >> > the cluster stops sending jobs from the queue after about 1600 jobs
>> >> > have
>> >> > been submitted. This does not happen when SGE is installed on a
>> >> > single
>> >> > Ubuntu machine I have at home. I am trying to figure out what is
>> >> > wrong.
>> >> > Did
>> >> > you impose some limit on the number of jobs? Can this be fixed? I
>> >> > really
>> >> > need to submit many jobs - tens of thousands jobs in my full runs.
>> >> > This
>> >> > one
>> >> > was relatively small and still did not pass.
>> >>
>> >> Looks like SGE thinks that the nodes are down or in alarm or error
>> >> state? To find out why SGE thinks there are no nodes available, run:
>> >> qstat -j
>> >>
>> >>
>> >>
>> >> > 3. I tried to start Star Cluster with 50 nodes and got an error about
>> >> > exceeding a quota of 20. Is it your quota or Amazon quota? Are there
>> >> > any
>> >> > other restrictions I should be aware of at the beginning? Also after
>> >> > the
>> >> > system is unable start the cluster it thinks it is still running and
>> >> > a
>> >> > terminate command is needed before another start can be issued - even
>> >> > though
>> >> > nothing got started.
>> >>
>> >> It's Amazon's quota. 50 is considered small by AWS standard, and they
>> >> can give it to you almost right away... You need to request AWS to
>> >> give you a higher limit:
>> >> https://aws.amazon.com/contact-us/ec2-request/
>> >>
>> >> Note that last year we requested for 10,000 nodes and the whole
>> >> process took less than 1 day:
>> >>
>> >>
>> >>
>> >> http://blogs.scalablelogic.com/2012/11/running-10000-node-grid-engine-cluster.html
>> >>
>> >> Rayson
>> >>
>> >> ==================================================
>> >> Open Grid Scheduler - The Official Open Source Grid Engine
>> >> http://gridscheduler.sourceforge.net/
>> >>
>> >>
>> >> >
>> >> > This all happened on us-west-2 with the help of star cluster 0.93.3
>> >> > and
>> >> > the
>> >> > Anaconda AMI - ami-a4d64194
>> >> >
>> >> > Here is some more information on what I am doing to help you answer
>> >> > the
>> >> > above.
>> >> >
>> >> > I am running Monte Carlo simulations to simulate chronic disease
>> >> > progression. I am using MIST to run over the cloud:
>> >> >
>> >> >
>> >> >
>> >> > https://github.com/scipy/scipy2013_talks/blob/master/talks/jacob_barhak/readme.md
>> >> >
>> >> > The Reference Model is what I am running using MIST:
>> >> >
>> >> > http://youtu.be/7qxPSgINaD8
>> >> >
>> >> > I am launching many simulations in parallel and it takes me days on a
>> >> > single
>> >> > 8 core machine. The cloud allows me to cut down this time to hours.
>> >> > This
>> >> > is
>> >> > why star cluster is so useful. In the past I did this over other
>> >> > clusters
>> >> > yet the cloud is still new to me.
>> >> >
>> >> > I will appreciate any recommendations I can get from you to improve
>> >> > the
>> >> > behaviors I am experiencing.
>>
>> http://commons.wikimedia.org/wiki/User:Raysonho
>
>
Received on Wed Jul 24 2013 - 08:29:52 EDT
This archive was generated by
hypermail 2.3.0.