StarCluster - Mailing List Archive

Re: Instances are not accepting jobs when the slots are available.

From: Jin Yu <no email>
Date: Fri, 18 Jul 2014 10:04:58 -0500

Ha, I see. Probably this is the cause and I want to validate it. Do you
know how to configure the job in SGE to force it using only one core per
job?

Thanks!
Jin


On Fri, Jul 18, 2014 at 12:09 AM, Rayson Ho <raysonlogin_at_gmail.com> wrote:

> The BLAS library in the StarCluster AMI is ATLAS, which takes advantage of
> multicore machines by running each BLAS call using multiple threads.
>
> I googled and found that SVD with ALTAS does use more than 1 core:
> http://blog.revolutionanalytics.com/2010/06/performance-benefits-of-multithreaded-r.html
>
> This can explain the behavior you are getting...
>
> Rayson
>
> ==================================================
> Open Grid Scheduler - The Official Open Source Grid Engine
> http://gridscheduler.sourceforge.net/
> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>
>
> On Thu, Jul 17, 2014 at 10:03 PM, Jin Yu <yujin2004_at_gmail.com> wrote:
>
>> Hi Rayson,
>>
>> My local virtualboxa also has more than one core, but the cpu usage is
>> never more than 100% when I run it locally. I also checked it using htop,
>> it only has one thread.
>>
>> I don't have any multicores or parallel modules in my R codes, but I did
>> have a couple of svd calls and exception handlers in my R code.
>>
>> Thanks!
>> Jin
>>
>>
>> On Thu, Jul 17, 2014 at 5:58 PM, Rayson Ho <raysonlogin_at_gmail.com> wrote:
>>
>>> What R program or module are you using?? As there are 32 virtual cores
>>> on the C3.8xlarge, if your code by default creates 1 thread per virtual
>>> core, then there will be 3200% CPU usage on the C3.8xlarge. And also, may
>>> be your local machine has much fewer cores and that's why this is not
>>> happening??
>>>
>>> Rayson
>>>
>>> ==================================================
>>> Open Grid Scheduler - The Official Open Source Grid Engine
>>> http://gridscheduler.sourceforge.net/
>>> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>>>
>>>
>>> On Thu, Jul 17, 2014 at 6:06 PM, Jin Yu <yujin2004_at_gmail.com> wrote:
>>>
>>>> Here is a followup of my investigation of the unusual high CPU/core
>>>> usage in EC2 instances.
>>>>
>>>> In the last post, I reported my observations of 1. unusual high
>>>> CPU/core usage of the R process in EC2 instances, which is designed to use
>>>> one core on the local machine; And 2. unusual high percentage of kernel
>>>> time in CPU usage.
>>>>
>>>> I looked more into the R processes using htop and found a lot of
>>>> threads were created in each of them. And there are tons of sched_yield()
>>>> system calls in each thread.
>>>>
>>>> Do these phenomenons with starcluster at EC2 ring a bell for someone?
>>>>
>>>> Thanks!
>>>> Jin
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Jul 17, 2014 at 3:48 PM, Jin Yu <yujin2004_at_gmail.com> wrote:
>>>>
>>>>> Hi Chris,
>>>>>
>>>>> Thanks for your prompt reply and point me to look the unusual high
>>>>> load of the instance! And I found something more mysterious in EC2
>>>>> instances (C3.8xlarge, to be more specific) :
>>>>>
>>>>> 1. I found some of my jobs are using CPU as much as 900%, although
>>>>> these job are designed to use only one core and behave so in my local
>>>>> machine, which lead to the unexpected high load of the system. Following is
>>>>> an example snapshot of these process.
>>>>>
>>>>> 2. While all the 8 running jobs takes 3000% CPU which is close to the
>>>>> full of 32 cores. The kernel time takes up to 70% of the CPU time.
>>>>>
>>>>> Are these problem related to the visualization nature of the EC2
>>>>> instances? Can you give me a hint to investigate them?
>>>>>
>>>>> Thanks!
>>>>> Jin
>>>>>
>>>>>
>>>>>
>>>>> [image: Inline image 1]
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jul 17, 2014 at 1:48 PM, Chris Dagdigian <dag_at_bioteam.net>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> Hi Jin,
>>>>>>
>>>>>> The cluster is not accepting jobs into those open slots because your
>>>>>> compute nodes are reporting alarm state "a" - your first host has a
>>>>>> reported load average of 148!
>>>>>>
>>>>>> Alarm state 'a' means "load threshold alarm level reached" it
>>>>>> basically
>>>>>> means that the server load is high enough that the nodes are refusing
>>>>>> new work until the load average goes down.
>>>>>>
>>>>>> All of those load alarm thresholds are configurable values within SGE
>>>>>> so
>>>>>> you can revise them upwards if you want
>>>>>>
>>>>>> Regards,
>>>>>> Chris
>>>>>>
>>>>>>
>>>>>> Jin Yu wrote:
>>>>>> > Hello,
>>>>>> >
>>>>>> > I just started a cluster of 20 c3.8xlarge instances, which have 32
>>>>>> > virtual cores in each. In my understanding, each instance should
>>>>>> have
>>>>>> > 32 slots available to run the jobs by default. But after running it
>>>>>> > for a while, I found a lot of nodes are not running at the full
>>>>>> speed.
>>>>>> >
>>>>>> > Following as an example, you can see node016 has only 13 jobs
>>>>>> running
>>>>>> > and node017 has 9 jobs running, while node018 has 32 jobs running. I
>>>>>> > have another ~10000 jobs waiting in the queue, so it is not a matter
>>>>>> > of running out of jobs.
>>>>>> >
>>>>>> > Can anyone give me a hint what is going on here?
>>>>>> >
>>>>>> > Thanks!
>>>>>> > Jin
>>>>>> >
>>>>>> >
>>>>>> > all.q_at_node016 BIP 0/13/32 148.35
>>>>>> linux-x64
>>>>>> > a
>>>>>> > 784 0.55500 job.part.a sgeadmin r 07/17/2014 11:25:59
>>>>>> > 1
>>>>>> > 982 0.55500 job.part.a sgeadmin r 07/17/2014 14:43:59
>>>>>> > 1
>>>>>> > 1056 0.55500 job.part.a sgeadmin r 07/17/2014 16:34:44
>>>>>> > 1
>>>>>> > 1057 0.55500 job.part.a sgeadmin r 07/17/2014 16:34:44
>>>>>> > 1
>>>>>> > 1058 0.55500 job.part.a sgeadmin r 07/17/2014 16:34:59
>>>>>> > 1
>>>>>> > 1121 0.55500 job.part.a sgeadmin r 07/17/2014 17:22:44
>>>>>> > 1
>>>>>> > 1122 0.55500 job.part.a sgeadmin r 07/17/2014 17:22:44
>>>>>> > 1
>>>>>> > 1123 0.55500 job.part.a sgeadmin r 07/17/2014 17:22:44
>>>>>> > 1
>>>>>> > 1124 0.55500 job.part.a sgeadmin r 07/17/2014 17:22:44
>>>>>> > 1
>>>>>> > 1125 0.55500 job.part.a sgeadmin r 07/17/2014 17:22:44
>>>>>> > 1
>>>>>> > 1126 0.55500 job.part.a sgeadmin r 07/17/2014 17:22:44
>>>>>> > 1
>>>>>> > 1127 0.55500 job.part.a sgeadmin r 07/17/2014 17:22:44
>>>>>> > 1
>>>>>> > 1128 0.55500 job.part.a sgeadmin r 07/17/2014 17:22:44
>>>>>> > 1
>>>>>> >
>>>>>> ---------------------------------------------------------------------------------
>>>>>> > all.q_at_node017 BIP 0/9/32 83.86
>>>>>> linux-x64
>>>>>> > a
>>>>>> > 568 0.55500 job.part.a sgeadmin r 07/17/2014 04:01:14
>>>>>> > 1
>>>>>> > 1001 0.55500 job.part.a sgeadmin r 07/17/2014 15:07:29
>>>>>> > 1
>>>>>> > 1002 0.55500 job.part.a sgeadmin r 07/17/2014 15:07:29
>>>>>> > 1
>>>>>> > 1072 0.55500 job.part.a sgeadmin r 07/17/2014 16:53:29
>>>>>> > 1
>>>>>> > 1116 0.55500 job.part.a sgeadmin r 07/17/2014 17:19:29
>>>>>> > 1
>>>>>> > 1117 0.55500 job.part.a sgeadmin r 07/17/2014 17:19:29
>>>>>> > 1
>>>>>> > 1118 0.55500 job.part.a sgeadmin r 07/17/2014 17:19:44
>>>>>> > 1
>>>>>> > 1119 0.55500 job.part.a sgeadmin r 07/17/2014 17:19:59
>>>>>> > 1
>>>>>> > 1120 0.55500 job.part.a sgeadmin r 07/17/2014 17:19:59
>>>>>> > 1
>>>>>> >
>>>>>> ---------------------------------------------------------------------------------
>>>>>> > all.q_at_node018 BIP 0/32/32 346.00
>>>>>> linux-x64
>>>>>> > a
>>>>>> >
>>>>>> > _______________________________________________
>>>>>> > StarCluster mailing list
>>>>>> > StarCluster_at_mit.edu
>>>>>> > http://mailman.mit.edu/mailman/listinfo/starcluster
>>>>>>
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> StarCluster mailing list
>>>> StarCluster_at_mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>>
>>>>
>>>
>>
>



ScreenClip.png
(image/png attachment: ScreenClip.png)

Received on Fri Jul 18 2014 - 11:05:02 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject