Re: DRMAA jobs failing when load balancer enabled and jobs longer than 60 mins (Lilley, John F.)
Hi Rayson and Mich,
Thanks for the feedback. Hardcoding the switches completely resolved the issue!
John
On Mar 6, 2014, at 11:36 AM, Rayson Ho <raysonlogin_at_gmail.com> wrote:
> It should only work for DRMAA jobs.
>
> Rayson
>
> ==================================================
> Open Grid Scheduler - The Official Open Source Grid Engine
> http://gridscheduler.sourceforge.net/
> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>
>
> On Thu, Mar 6, 2014 at 2:31 PM, François-Michel L'Heureux
> <fmlheureux_at_datacratic.com> wrote:
>> Hi!
>>
>> From experience, I don't think it works for qrsh though.
>> Justin also just tried it and told me it doesn't work.
>>
>>
>>
>>
>> 2014-03-06 14:26 GMT-05:00 Rayson Ho <raysonlogin_at_gmail.com>:
>>
>>> Hi Mich,
>>>
>>> Thanks for sharing the workaround.
>>>
>>> The behavior is due to a relatively undocumented feature of DRMAA /
>>> Grid Engine -- basically DRAMM jobs in Grid Engine has "-w e" added to
>>> the job submission request. The -w flag takes the following arguments:
>>>
>>> `e' error - jobs with invalid requests will be rejected.
>>>
>>> `w' warning - only a warning will be displayed for invalid
>>> requests.
>>>
>>> `n' none - switches off validation; the default for qsub,
>>> qalter, qrsh, qsh and qlogin.
>>>
>>> `p' poke - does not submit the job but prints a validation
>>> report based on a cluster as is with all resource utilizations in
>>> place.
>>>
>>> `v' verify - does not submit the job but prints a
>>> validation report based on an empty cluster.
>>>
>>> http://gridscheduler.sourceforge.net/htmlman/htmlman1/qsub.html
>>>
>>> Thus with "-w e", if Grid Engine is not happy with the job at
>>> submission time (eg. it thinks that it does not have enough nodes to
>>> run the job), then it will reject the job submission.
>>>
>>> The correct way is to override the DRMAA request with "-w n" or "-w w"
>>> if you are going to use load-balancing.
>>>
>>> Rayson
>>>
>>> ==================================================
>>> Open Grid Scheduler - The Official Open Source Grid Engine
>>> http://gridscheduler.sourceforge.net/
>>> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>>>
>>>
>>> On Thu, Mar 6, 2014 at 2:13 PM, François-Michel L'Heureux
>>> <fmlheureux_at_datacratic.com> wrote:
>>>> Hi John
>>>>
>>>> I assume DRMAA is a replacement to OGS/SGE?
>>>>
>>>> About DRMAA bailing out, I don't know the product, but your guess is
>>>> likely
>>>> correct: I might crash when nodes go away. There is a somewhat similar
>>>> issue
>>>> with OGS where we need to clean it when nodes go away. It doesn't crash
>>>> though.
>>>>
>>>> For your second issue, regarding execution host, again, I had a similar
>>>> issue with OGS. The trick I used is that I left the master node as an
>>>> execution host, but I defined its number of slots to 0. Hence, OGS is
>>>> happe
>>>> because there is at least an exec host and the load balancer runs just
>>>> fine
>>>> because when there is only the master node online, there is no slots so
>>>> it
>>>> immediately adds node whenever jobs come in. I don't know if there is a
>>>> concept of slots in DRMAA or if this version of the loadbalancer uses it
>>>> but
>>>> if so, I think you could reproduce my trick.
>>>>
>>>> I hope it will help you.
>>>>
>>>> Mich
>>>>
>>>> _______________________________________________
>>>> StarCluster mailing list
>>>> StarCluster_at_mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>>
>>
>>
>
Received on Tue Mar 11 2014 - 16:55:10 EDT
This archive was generated by
hypermail 2.3.0.