StarCluster - Mailing List Archive

Re: DRMAA jobs failing when load balancer enabled and jobs longer than 60 mins (Lilley, John F.)

From: François-Michel L'Heureux <no email>
Date: Thu, 6 Mar 2014 14:31:35 -0500

Hi!

>From experience, I don't think it works for qrsh though.
Justin also just tried it and told me it doesn't work.




2014-03-06 14:26 GMT-05:00 Rayson Ho <raysonlogin_at_gmail.com>:

> Hi Mich,
>
> Thanks for sharing the workaround.
>
> The behavior is due to a relatively undocumented feature of DRMAA /
> Grid Engine -- basically DRAMM jobs in Grid Engine has "-w e" added to
> the job submission request. The -w flag takes the following arguments:
>
> `e' error - jobs with invalid requests will be rejected.
>
> `w' warning - only a warning will be displayed for invalid
> requests.
>
> `n' none - switches off validation; the default for qsub,
> qalter, qrsh, qsh and qlogin.
>
> `p' poke - does not submit the job but prints a validation
> report based on a cluster as is with all resource utilizations in
> place.
>
> `v' verify - does not submit the job but prints a
> validation report based on an empty cluster.
>
> http://gridscheduler.sourceforge.net/htmlman/htmlman1/qsub.html
>
> Thus with "-w e", if Grid Engine is not happy with the job at
> submission time (eg. it thinks that it does not have enough nodes to
> run the job), then it will reject the job submission.
>
> The correct way is to override the DRMAA request with "-w n" or "-w w"
> if you are going to use load-balancing.
>
> Rayson
>
> ==================================================
> Open Grid Scheduler - The Official Open Source Grid Engine
> http://gridscheduler.sourceforge.net/
> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>
>
> On Thu, Mar 6, 2014 at 2:13 PM, François-Michel L'Heureux
> <fmlheureux_at_datacratic.com> wrote:
> > Hi John
> >
> > I assume DRMAA is a replacement to OGS/SGE?
> >
> > About DRMAA bailing out, I don't know the product, but your guess is
> likely
> > correct: I might crash when nodes go away. There is a somewhat similar
> issue
> > with OGS where we need to clean it when nodes go away. It doesn't crash
> > though.
> >
> > For your second issue, regarding execution host, again, I had a similar
> > issue with OGS. The trick I used is that I left the master node as an
> > execution host, but I defined its number of slots to 0. Hence, OGS is
> happe
> > because there is at least an exec host and the load balancer runs just
> fine
> > because when there is only the master node online, there is no slots so
> it
> > immediately adds node whenever jobs come in. I don't know if there is a
> > concept of slots in DRMAA or if this version of the loadbalancer uses it
> but
> > if so, I think you could reproduce my trick.
> >
> > I hope it will help you.
> >
> > Mich
> >
> > _______________________________________________
> > StarCluster mailing list
> > StarCluster_at_mit.edu
> > http://mailman.mit.edu/mailman/listinfo/starcluster
> >
>
Received on Thu Mar 06 2014 - 14:31:56 EST
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject