Re: DRMAA jobs failing when load balancer enabled and jobs longer than 60 mins (Lilley, John F.)
This archive was generated by
>From experience, I don't think it works for qrsh though.
Justin also just tried it and told me it doesn't work.
2014-03-06 14:26 GMT-05:00 Rayson Ho <raysonlogin_at_gmail.com>:
> Hi Mich,
> Thanks for sharing the workaround.
> The behavior is due to a relatively undocumented feature of DRMAA /
> Grid Engine -- basically DRAMM jobs in Grid Engine has "-w e" added to
> the job submission request. The -w flag takes the following arguments:
> `e' error - jobs with invalid requests will be rejected.
> `w' warning - only a warning will be displayed for invalid
> `n' none - switches off validation; the default for qsub,
> qalter, qrsh, qsh and qlogin.
> `p' poke - does not submit the job but prints a validation
> report based on a cluster as is with all resource utilizations in
> `v' verify - does not submit the job but prints a
> validation report based on an empty cluster.
> Thus with "-w e", if Grid Engine is not happy with the job at
> submission time (eg. it thinks that it does not have enough nodes to
> run the job), then it will reject the job submission.
> The correct way is to override the DRMAA request with "-w n" or "-w w"
> if you are going to use load-balancing.
> Open Grid Scheduler - The Official Open Source Grid Engine
> On Thu, Mar 6, 2014 at 2:13 PM, François-Michel L'Heureux
> <fmlheureux_at_datacratic.com> wrote:
> > Hi John
> > I assume DRMAA is a replacement to OGS/SGE?
> > About DRMAA bailing out, I don't know the product, but your guess is
> > correct: I might crash when nodes go away. There is a somewhat similar
> > with OGS where we need to clean it when nodes go away. It doesn't crash
> > though.
> > For your second issue, regarding execution host, again, I had a similar
> > issue with OGS. The trick I used is that I left the master node as an
> > execution host, but I defined its number of slots to 0. Hence, OGS is
> > because there is at least an exec host and the load balancer runs just
> > because when there is only the master node online, there is no slots so
> > immediately adds node whenever jobs come in. I don't know if there is a
> > concept of slots in DRMAA or if this version of the loadbalancer uses it
> > if so, I think you could reproduce my trick.
> > I hope it will help you.
> > Mich
> > _______________________________________________
> > StarCluster mailing list
> > StarCluster_at_mit.edu
> > http://mailman.mit.edu/mailman/listinfo/starcluster
Received on Thu Mar 06 2014 - 14:31:56 EST