StarCluster - Mailing List Archive

Re: DRMAA jobs failing when load balancer enabled and jobs longer than 60 mins (Lilley, John F.)

From: Rayson Ho <no email>
Date: Thu, 6 Mar 2014 14:26:10 -0500

Hi Mich,

Thanks for sharing the workaround.

The behavior is due to a relatively undocumented feature of DRMAA /
Grid Engine -- basically DRAMM jobs in Grid Engine has "-w e" added to
the job submission request. The -w flag takes the following arguments:

          `e' error - jobs with invalid requests will be rejected.

          `w' warning - only a warning will be displayed for invalid requests.

          `n' none - switches off validation; the default for qsub,
qalter, qrsh, qsh and qlogin.

          `p' poke - does not submit the job but prints a validation
report based on a cluster as is with all resource utilizations in
place.

          `v' verify - does not submit the job but prints a
validation report based on an empty cluster.

http://gridscheduler.sourceforge.net/htmlman/htmlman1/qsub.html

Thus with "-w e", if Grid Engine is not happy with the job at
submission time (eg. it thinks that it does not have enough nodes to
run the job), then it will reject the job submission.

The correct way is to override the DRMAA request with "-w n" or "-w w"
if you are going to use load-balancing.

Rayson

==================================================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/
http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html


On Thu, Mar 6, 2014 at 2:13 PM, François-Michel L'Heureux
<fmlheureux_at_datacratic.com> wrote:
> Hi John
>
> I assume DRMAA is a replacement to OGS/SGE?
>
> About DRMAA bailing out, I don't know the product, but your guess is likely
> correct: I might crash when nodes go away. There is a somewhat similar issue
> with OGS where we need to clean it when nodes go away. It doesn't crash
> though.
>
> For your second issue, regarding execution host, again, I had a similar
> issue with OGS. The trick I used is that I left the master node as an
> execution host, but I defined its number of slots to 0. Hence, OGS is happe
> because there is at least an exec host and the load balancer runs just fine
> because when there is only the master node online, there is no slots so it
> immediately adds node whenever jobs come in. I don't know if there is a
> concept of slots in DRMAA or if this version of the loadbalancer uses it but
> if so, I think you could reproduce my trick.
>
> I hope it will help you.
>
> Mich
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
Received on Thu Mar 06 2014 - 14:26:12 EST
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject