Re: Instances are not accepting jobs when the slots are available.

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

From: Jin Yu <no email>
Date: Fri, 18 Jul 2014 15:39:13 -0500

I just found openBLAS is already the default BLAS library in my customized
AMI. And the problem is solved by simply setting "export
OPENBLAS_NUM_THREADS=1". Thanks Rayson and Chris for your valuable
suggestions. Without your prompt helps, I cannot figure out the issue so
fast.

I have submitted 20000 jobs of ~100,000 cpu hours. I am using loadbalancer
to gradually increase the size of cluster to ~5000 cores. Let's see how
well the sge and starcluster will handler this.

Thanks!
Jin

On Fri, Jul 18, 2014 at 12:29 PM, Rayson Ho <raysonlogin_at_gmail.com> wrote:

> The best way to handle this is to setup a PE in Grid Engine, so that the
> Grid Engine scheduler knows that your job uses more than 1 CPU core.
>
> ALso, Grid Engine has the ability to bind a job to a CPU core, but ATLAS
> would still create 32 threads on a c3.8xlarge, as it still gets the full
> view of the hardware configuration. So with core binding, the job can only
> use 1 CPU, but ATLAS still creates 32 threads, and all 32 threads will be
> fighting for CPU resources. And ATLAS does not have a dynamic way to limit
> the number of threads it creates. IMO if R works with other BLAS libraries,
> then switching to use something like OpenBLAS and tell it to use only the
> number of cores Grid Engine assigns to the job would fix this issue. And if
> you have a large number of R jobs, then use a serial BLAS library can be a
> good choice too, because you have 32 jobs that each uses 1 virtual core of
> the c3.8xlarge.
>
> Rayson
>
> ==================================================
> Open Grid Scheduler - The Official Open Source Grid Engine
> http://gridscheduler.sourceforge.net/
> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>
>
> On Fri, Jul 18, 2014 at 12:01 PM, Jin Yu <yujin2004_at_gmail.com> wrote:
>
>> Rayson,
>>
>> Thank you for pointing out the ATLAS in the starcluster AMI could creates
>> multiple threads by default in matrix manipulations. I checked my running
>> jobs again more carefully using htop and I found each job created exactly
>> 32 threads! That is why I have an overload of ~100.
>>
>> I have qalter all the waiting jobs for a hard limit of using only one
>> core. I will report back if it is not working.
>>
>> Thanks!
>> Jin
>>
>>
>> On Fri, Jul 18, 2014 at 10:04 AM, Jin Yu <yujin2004_at_gmail.com> wrote:
>>
>>> Ha, I see. Probably this is the cause and I want to validate it. Do you
>>> know how to configure the job in SGE to force it using only one core per
>>> job?
>>>
>>> Thanks!
>>> Jin
>>>
>>>
>>> On Fri, Jul 18, 2014 at 12:09 AM, Rayson Ho <raysonlogin_at_gmail.com>
>>> wrote:
>>>
>>>> The BLAS library in the StarCluster AMI is ATLAS, which takes advantage
>>>> of multicore machines by running each BLAS call using multiple threads.
>>>>
>>>> I googled and found that SVD with ALTAS does use more than 1 core:
>>>> http://blog.revolutionanalytics.com/2010/06/performance-benefits-of-multithreaded-r.html
>>>>
>>>> This can explain the behavior you are getting...
>>>>
>>>> Rayson
>>>>
>>>> ==================================================
>>>> Open Grid Scheduler - The Official Open Source Grid Engine
>>>> http://gridscheduler.sourceforge.net/
>>>> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>>>>
>>>>
>>>> On Thu, Jul 17, 2014 at 10:03 PM, Jin Yu <yujin2004_at_gmail.com> wrote:
>>>>
>>>>> Hi Rayson,
>>>>>
>>>>> My local virtualboxa also has more than one core, but the cpu usage is
>>>>> never more than 100% when I run it locally. I also checked it using htop,
>>>>> it only has one thread.
>>>>>
>>>>> I don't have any multicores or parallel modules in my R codes, but I
>>>>> did have a couple of svd calls and exception handlers in my R code.
>>>>>
>>>>> Thanks!
>>>>> Jin
>>>>>
>>>>>
>>>>> On Thu, Jul 17, 2014 at 5:58 PM, Rayson Ho <raysonlogin_at_gmail.com>
>>>>> wrote:
>>>>>
>>>>>> What R program or module are you using?? As there are 32 virtual
>>>>>> cores on the C3.8xlarge, if your code by default creates 1 thread per
>>>>>> virtual core, then there will be 3200% CPU usage on the C3.8xlarge. And
>>>>>> also, may be your local machine has much fewer cores and that's why this is
>>>>>> not happening??
>>>>>>
>>>>>> Rayson
>>>>>>
>>>>>> ==================================================
>>>>>> Open Grid Scheduler - The Official Open Source Grid Engine
>>>>>> http://gridscheduler.sourceforge.net/
>>>>>> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 17, 2014 at 6:06 PM, Jin Yu <yujin2004_at_gmail.com> wrote:
>>>>>>
>>>>>>> Here is a followup of my investigation of the unusual high CPU/core
>>>>>>> usage in EC2 instances.
>>>>>>>
>>>>>>> In the last post, I reported my observations of 1. unusual high
>>>>>>> CPU/core usage of the R process in EC2 instances, which is designed to use
>>>>>>> one core on the local machine; And 2. unusual high percentage of kernel
>>>>>>> time in CPU usage.
>>>>>>>
>>>>>>> I looked more into the R processes using htop and found a lot of
>>>>>>> threads were created in each of them. And there are tons of sched_yield()
>>>>>>> system calls in each thread.
>>>>>>>
>>>>>>> Do these phenomenons with starcluster at EC2 ring a bell for someone?
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Jin
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
Received on Fri Jul 18 2014 - 16:39:14 EDT

This message: [ Message body ]
Next message: Rayson Ho: "Re: Instances are not accepting jobs when the slots are available."
Previous message: Jin Yu: "error when adding nodes to the cluster"
Maybe in reply to: Chris Dagdigian: "Re: Instances are not accepting jobs when the slots are available."
Next in thread: Rayson Ho: "Re: Instances are not accepting jobs when the slots are available."

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

This archive was generated by hypermail 2.3.0.

Re: Instances are not accepting jobs when the slots are available.

Search:

Sort all by:

Navigation

Re: Instances are not accepting jobs when the slots are available.

Search:

Sort all by:

Navigation