StarCluster - Mailing List Archive

Re: DRMAA jobs failing when load balancer enabled and jobs longer than 60 mins

From: Rayson Ho <no email>
Date: Thu, 6 Mar 2014 14:11:27 -0500

Hi John,

The 2nd case ("Output received from DRMAA when there are no execution
hosts available on initial job submission") is valid because Grid
Engine by default checks if the job can ever be scheduled by Grid
Engine or not. And if there is no execution host available at job
submission time, then Grid Engine will reject the job. You can
override the behavior by using the DRMAA native attribute (-w n), and
you can check the qsub manpage for the "-w" options:

 http://gridscheduler.sourceforge.net/htmlman/htmlman1/qsub.html

I am more interested in the first case (your "Output received from
DRMAA after an hour when load balancing" case), and seems like it has
to do with the load balancer taking nodes (AWS instances) offline. Can
you run the load balancer interactively, so that we can get a log of
what's going on??

Rayson

==================================================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/
http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html


On Thu, Mar 6, 2014 at 1:56 PM, Chris Dagdigian <dag_at_bioteam.net> wrote:
>
> FYI DRMAA is an API for job submission and control that is supported by
> a number of different cluster schedulers making it an attractive target
> for individuals writing portable cluster-aware tools and many commercial
> ISVs who need to create software that speaks with HPC resources.
>
> The formal website is: http://www.drmaa.org/
>
> -Chris
>
>
>> Rajat Banerjee <mailto:rajatb_at_post.harvard.edu>
>> March 6, 2014 1:49 PM
>> Hi John,
>> Could you explain a little more about what a DRMAA job is and what
>> resources it requires? Found something on wikipedia but it doesn't
>> seem relevant.
>>
>> I wrote big parts of the load balancer and am guessing that it does
>> not understand your inter-machine dependencies. Sounds like your job
>> is somewhat tolerant of hosts dropping off, but we can probably come
>> up with a better solution.
>>
>> Best,
>> Rajat
>>
>>
>>
>> _______________________________________________
>> StarCluster mailing list
>> StarCluster_at_mit.edu
>> http://mailman.mit.edu/mailman/listinfo/starcluster
>> Lilley, John F. <mailto:johnbot_at_caltech.edu>
>> March 6, 2014 12:11 PM
>> Hi,
>>
>>
>> I'm running a simple java based DRMAA job that runs a sleep commands
>> on each of the Starcluster compute instances but am having a problem.
>> If I create 10 non-loadbalanced nodes and then submit a job from DRMAA
>> that runs sleep on each of those nodes in parallel everything
>> completes fine.If I submit the same 10 node DRMAA sleep job with 1
>> non-loadbalanced node available everything works fine and the jobs
>> eventually work their way through the single node serially and the
>> main DRMAA process is happy.
>>
>> If I then enable load balancing, submit 10 jobs that sleep 30 minutes
>> each from the main DRMAA process it load balances beautifully by
>> adding 9 nodes, all 10 jobs complete and the main process exits
>> gracefully. However, if I submit 10 jobs that sleep for 70 minutes
>> most of them finish but then the DRMAA process bails before all 10
>> jobs are complete. My guess is that when the first sleep jobs start to
>> finish up the load balancer removes the nodes they ran on from the
>> available execution hosts throwing the main DRMAA process which is
>> monitoring the jobs for a loop.
>>
>>
>> Perhaps there's a way I can make the DRMAA process more tolerant of
>> execution hosts being removed from the available pool? Another issue I
>> have is that unless I have 1 execution host running all the time the
>> DRMAA process refuses to start at all. I'd rather not have to keep an
>> execution host running to accept the DRMAA job submissions if
>> possible. I would really appreciate hearing any insights the community
>> has on running DRMAA jobs in Starcluster and if anyone has experienced
>> similar obstacles.
>>
>>
>> Thanks for the help!
>> John
>>
>>
>> Output received from DRMAA after an hour when load balancing (jobs
>> over 70 minutes)
>> --------------------------------------------------------------------------------------------------------------------
>> INFO [2014-03-05 19:59:26,077] [OGSJob.java:211] [main] Waiting for
>> job 61...
>> Exception in thread "main" java.lang.IllegalStateException
>> at com.sun.grid.drmaa.JobInfoImpl.getExitStatus(JobInfoImpl.java:75)
>> at nextgen.core.job.OGSJob.waitFor(OGSJob.java:213)
>> at nextgen.core.job.JobUtils.waitForAll(JobUtils.java:23)
>> at tests.DrmaaSleepTest.main(DrmaaSleepTest.java:50)
>> INFO [2014-03-05 19:59:37,064] [OGSUtils.java:84] [Thread-0] Ending
>> DRMAA session
>> --------------------------------------------------------------------------------------------------------------------
>>
>> Output received from DRMAA when there are no execution hosts available
>> on initial job submission (If I have execution host available it
>> submits OK)
>> --------------------------------------------------------------------------------------------------------------------
>> user_at_master:~/$ java -jar DrmaaSleepTest.jar -m 5 -n 10 -s Sleep.jar
>> log4j:ERROR Could not find value for key log4j.appender.R
>> log4j:ERROR Could not instantiate appender named "R".
>> WARN [2014-03-06 05:09:54,984] [OGSUtils.java:65] [main] Starting a
>> DRMAA session.
>> WARN [2014-03-06 05:09:54,989] [OGSUtils.java:66] [main] There should
>> only be one active DRMAA session at a time.
>> INFO [2014-03-06 05:09:55,430] [OGSUtils.java:92] [main] Attached
>> shutdown hook to close DRMAA session upon JVM exit.
>> Exception in thread "main" org.ggf.drmaa.DeniedByDrmException:
>> warning:user your job is not allowed to run in any queue
>> error: no suitable queues
>> at com.sun.grid.drmaa.SessionImpl.nativeRunJob(Native Method)
>> at com.sun.grid.drmaa.SessionImpl.runJob(SessionImpl.java:349)
>> at nextgen.core.job.OGSJob.submit(OGSJob.java:188)
>> at tests.DrmaaSleepTest.main(DrmaaSleepTest.java:46)
>> INFO [2014-03-06 05:09:55,500] [OGSUtils.java:84] [Thread-0] Ending
>> DRMAA session
>> --------------------------------------------------------------------------------------------------------------------
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> StarCluster mailing list
>> StarCluster_at_mit.edu
>> http://mailman.mit.edu/mailman/listinfo/starcluster
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
Received on Thu Mar 06 2014 - 14:11:29 EST
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject