Re: Instances are not accepting jobs when the slots are available.

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

From: Rayson Ho <no email>
Date: Fri, 18 Jul 2014 19:26:20 -0400

(The list blocked this email I was trying to send earlier -- it
reached Jin, but I am sending it to the list again so that in the
future when people google search similar issues, they will have a
complete record of what happened.)

The best way to handle this is to setup a PE in Grid Engine, so that the
Grid Engine scheduler knows that your job uses more than 1 CPU core.

Also, Grid Engine has the ability to bind a job to a CPU core, but ATLAS
would still create 32 threads on a c3.8xlarge, as it still gets the full
view of the hardware configuration. So with core binding, the job can only
use 1 CPU, but ATLAS still creates 32 threads, and all 32 threads will be
fighting for CPU resources. And ATLAS does not have a way to dynamically limit
the number of threads it creates. IMO if R works with other BLAS libraries,
then switching to use something like OpenBLAS and tell it to use only the
number of cores Grid Engine assigns to the job would fix this issue. And if
you have a large number of R jobs, then use a serial BLAS library can be a
good choice too, because you have 32 jobs that each uses 1 virtual core of
the c3.8xlarge.

Rayson

==================================================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/
http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html

On Fri, Jul 18, 2014 at 12:01 PM, Jin Yu <yujin2004_at_gmail.com> wrote:

> Rayson,
>
> Thank you for pointing out the ATLAS in the starcluster AMI could creates
> multiple threads by default in matrix manipulations. I checked my running
> jobs again more carefully using htop and I found each job created exactly
> 32 threads! That is why I have an overload of ~100.
>
> I have qalter all the waiting jobs for a hard limit of using only one
> core. I will report back if it is not working.
>
> Thanks!
> Jin
Received on Fri Jul 18 2014 - 19:26:23 EDT

This message: [ Message body ]
Next message: Jin Yu: "[Star cluster] error tolerance design when adding nodes"
Previous message: Jin Yu: "Re: Instances are not accepting jobs when the slots are available."
Maybe in reply to: Chris Dagdigian: "Re: Instances are not accepting jobs when the slots are available."

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

This archive was generated by hypermail 2.3.0.

Re: Instances are not accepting jobs when the slots are available.

Search:

Sort all by:

Navigation

Re: Instances are not accepting jobs when the slots are available.

Search:

Sort all by:

Navigation