DRMAA jobs failing when load balancer enabled and jobs longer than 60 mins
Hi,
I’m running a simple java based DRMAA job that runs a sleep commands on each of the Starcluster compute instances but am having a problem. If I create 10 non-loadbalanced nodes and then submit a job from DRMAA that runs sleep on each of those nodes in parallel everything completes fine.If I submit the same 10 node DRMAA sleep job with 1 non-loadbalanced node available everything works fine and the jobs eventually work their way through the single node serially and the main DRMAA process is happy.
If I then enable load balancing, submit 10 jobs that sleep 30 minutes each from the main DRMAA process it load balances beautifully by adding 9 nodes, all 10 jobs complete and the main process exits gracefully. However, if I submit 10 jobs that sleep for 70 minutes most of them finish but then the DRMAA process bails before all 10 jobs are complete. My guess is that when the first sleep jobs start to finish up the load balancer removes the nodes they ran on from the available execution hosts throwing the main DRMAA process which is monitoring the jobs for a loop.
Perhaps there’s a way I can make the DRMAA process more tolerant of execution hosts being removed from the available pool? Another issue I have is that unless I have 1 execution host running all the time the DRMAA process refuses to start at all. I’d rather not have to keep an execution host running to accept the DRMAA job submissions if possible. I would really appreciate hearing any insights the community has on running DRMAA jobs in Starcluster and if anyone has experienced similar obstacles.
Thanks for the help!
John
Output received from DRMAA after an hour when load balancing (jobs over 70 minutes)
--------------------------------------------------------------------------------------------------------------------
INFO [2014-03-05 19:59:26,077] [OGSJob.java:211] [main] Waiting for job 61...
Exception in thread "main" java.lang.IllegalStateException
at com.sun.grid.drmaa.JobInfoImpl.getExitStatus(JobInfoImpl.java:75)
at nextgen.core.job.OGSJob.waitFor(OGSJob.java:213)
at nextgen.core.job.JobUtils.waitForAll(JobUtils.java:23)
at tests.DrmaaSleepTest.main(DrmaaSleepTest.java:50)
INFO [2014-03-05 19:59:37,064] [OGSUtils.java:84] [Thread-0] Ending DRMAA session
--------------------------------------------------------------------------------------------------------------------
Output received from DRMAA when there are no execution hosts available on initial job submission (If I have execution host available it submits OK)
--------------------------------------------------------------------------------------------------------------------
user_at_master:~/$ java -jar DrmaaSleepTest.jar -m 5 -n 10 -s Sleep.jar
log4j:ERROR Could not find value for key log4j.appender.R
log4j:ERROR Could not instantiate appender named "R".
WARN [2014-03-06 05:09:54,984] [OGSUtils.java:65] [main] Starting a DRMAA session.
WARN [2014-03-06 05:09:54,989] [OGSUtils.java:66] [main] There should only be one active DRMAA session at a time.
INFO [2014-03-06 05:09:55,430] [OGSUtils.java:92] [main] Attached shutdown hook to close DRMAA session upon JVM exit.
Exception in thread "main" org.ggf.drmaa.DeniedByDrmException: warning:user your job is not allowed to run in any queue
error: no suitable queues
at com.sun.grid.drmaa.SessionImpl.nativeRunJob(Native Method)
at com.sun.grid.drmaa.SessionImpl.runJob(SessionImpl.java:349)
at nextgen.core.job.OGSJob.submit(OGSJob.java:188)
at tests.DrmaaSleepTest.main(DrmaaSleepTest.java:46)
INFO [2014-03-06 05:09:55,500] [OGSUtils.java:84] [Thread-0] Ending DRMAA session
--------------------------------------------------------------------------------------------------------------------
Received on Thu Mar 06 2014 - 12:11:37 EST
This archive was generated by
hypermail 2.3.0.