Re: Instances are not accepting jobs when the slots are available.

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

From: Jin Yu <no email>
Date: Thu, 17 Jul 2014 21:03:20 -0500

Hi Rayson,

My local virtualboxa also has more than one core, but the cpu usage is
never more than 100% when I run it locally. I also checked it using htop,
it only has one thread.

I don't have any multicores or parallel modules in my R codes, but I did
have a couple of svd calls and exception handlers in my R code.

Thanks!
Jin

On Thu, Jul 17, 2014 at 5:58 PM, Rayson Ho <raysonlogin_at_gmail.com> wrote:

> What R program or module are you using?? As there are 32 virtual cores on
> the C3.8xlarge, if your code by default creates 1 thread per virtual core,
> then there will be 3200% CPU usage on the C3.8xlarge. And also, may be your
> local machine has much fewer cores and that's why this is not happening??
>
> Rayson
>
> ==================================================
> Open Grid Scheduler - The Official Open Source Grid Engine
> http://gridscheduler.sourceforge.net/
> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>
>
> On Thu, Jul 17, 2014 at 6:06 PM, Jin Yu <yujin2004_at_gmail.com> wrote:
>
>> Here is a followup of my investigation of the unusual high CPU/core usage
>> in EC2 instances.
>>
>> In the last post, I reported my observations of 1. unusual high CPU/core
>> usage of the R process in EC2 instances, which is designed to use one core
>> on the local machine; And 2. unusual high percentage of kernel time in CPU
>> usage.
>>
>> I looked more into the R processes using htop and found a lot of threads
>> were created in each of them. And there are tons of sched_yield() system
>> calls in each thread.
>>
>> Do these phenomenons with starcluster at EC2 ring a bell for someone?
>>
>> Thanks!
>> Jin
>>
>>
>>
>>
>> On Thu, Jul 17, 2014 at 3:48 PM, Jin Yu <yujin2004_at_gmail.com> wrote:
>>
>>> Hi Chris,
>>>
>>> Thanks for your prompt reply and point me to look the unusual high load
>>> of the instance! And I found something more mysterious in EC2 instances
>>> (C3.8xlarge, to be more specific) :
>>>
>>> 1. I found some of my jobs are using CPU as much as 900%, although these
>>> job are designed to use only one core and behave so in my local machine,
>>> which lead to the unexpected high load of the system. Following is an
>>> example snapshot of these process.
>>>
>>> 2. While all the 8 running jobs takes 3000% CPU which is close to the
>>> full of 32 cores. The kernel time takes up to 70% of the CPU time.
>>>
>>> Are these problem related to the visualization nature of the EC2
>>> instances? Can you give me a hint to investigate them?
>>>
>>> Thanks!
>>> Jin
>>>
>>>
>>>
>>> [image: Inline image 1]
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Jul 17, 2014 at 1:48 PM, Chris Dagdigian <dag_at_bioteam.net>
>>> wrote:
>>>
>>>>
>>>> Hi Jin,
>>>>
>>>> The cluster is not accepting jobs into those open slots because your
>>>> compute nodes are reporting alarm state "a" - your first host has a
>>>> reported load average of 148!
>>>>
>>>> Alarm state 'a' means "load threshold alarm level reached" it basically
>>>> means that the server load is high enough that the nodes are refusing
>>>> new work until the load average goes down.
>>>>
>>>> All of those load alarm thresholds are configurable values within SGE so
>>>> you can revise them upwards if you want
>>>>
>>>> Regards,
>>>> Chris
>>>>
>>>>
>>>> Jin Yu wrote:
>>>> > Hello,
>>>> >
>>>> > I just started a cluster of 20 c3.8xlarge instances, which have 32
>>>> > virtual cores in each. In my understanding, each instance should have
>>>> > 32 slots available to run the jobs by default. But after running it
>>>> > for a while, I found a lot of nodes are not running at the full speed.
>>>> >
>>>> > Following as an example, you can see node016 has only 13 jobs running
>>>> > and node017 has 9 jobs running, while node018 has 32 jobs running. I
>>>> > have another ~10000 jobs waiting in the queue, so it is not a matter
>>>> > of running out of jobs.
>>>> >
>>>> > Can anyone give me a hint what is going on here?
>>>> >
>>>> > Thanks!
>>>> > Jin
>>>> >
>>>> >
>>>> > all.q_at_node016 BIP 0/13/32 148.35
>>>> linux-x64
>>>> > a
>>>> > 784 0.55500 job.part.a sgeadmin r 07/17/2014 11:25:59
>>>> > 1
>>>> > 982 0.55500 job.part.a sgeadmin r 07/17/2014 14:43:59
>>>> > 1
>>>> > 1056 0.55500 job.part.a sgeadmin r 07/17/2014 16:34:44
>>>> > 1
>>>> > 1057 0.55500 job.part.a sgeadmin r 07/17/2014 16:34:44
>>>> > 1
>>>> > 1058 0.55500 job.part.a sgeadmin r 07/17/2014 16:34:59
>>>> > 1
>>>> > 1121 0.55500 job.part.a sgeadmin r 07/17/2014 17:22:44
>>>> > 1
>>>> > 1122 0.55500 job.part.a sgeadmin r 07/17/2014 17:22:44
>>>> > 1
>>>> > 1123 0.55500 job.part.a sgeadmin r 07/17/2014 17:22:44
>>>> > 1
>>>> > 1124 0.55500 job.part.a sgeadmin r 07/17/2014 17:22:44
>>>> > 1
>>>> > 1125 0.55500 job.part.a sgeadmin r 07/17/2014 17:22:44
>>>> > 1
>>>> > 1126 0.55500 job.part.a sgeadmin r 07/17/2014 17:22:44
>>>> > 1
>>>> > 1127 0.55500 job.part.a sgeadmin r 07/17/2014 17:22:44
>>>> > 1
>>>> > 1128 0.55500 job.part.a sgeadmin r 07/17/2014 17:22:44
>>>> > 1
>>>> >
>>>> ---------------------------------------------------------------------------------
>>>> > all.q_at_node017 BIP 0/9/32 83.86
>>>> linux-x64
>>>> > a
>>>> > 568 0.55500 job.part.a sgeadmin r 07/17/2014 04:01:14
>>>> > 1
>>>> > 1001 0.55500 job.part.a sgeadmin r 07/17/2014 15:07:29
>>>> > 1
>>>> > 1002 0.55500 job.part.a sgeadmin r 07/17/2014 15:07:29
>>>> > 1
>>>> > 1072 0.55500 job.part.a sgeadmin r 07/17/2014 16:53:29
>>>> > 1
>>>> > 1116 0.55500 job.part.a sgeadmin r 07/17/2014 17:19:29
>>>> > 1
>>>> > 1117 0.55500 job.part.a sgeadmin r 07/17/2014 17:19:29
>>>> > 1
>>>> > 1118 0.55500 job.part.a sgeadmin r 07/17/2014 17:19:44
>>>> > 1
>>>> > 1119 0.55500 job.part.a sgeadmin r 07/17/2014 17:19:59
>>>> > 1
>>>> > 1120 0.55500 job.part.a sgeadmin r 07/17/2014 17:19:59
>>>> > 1
>>>> >
>>>> ---------------------------------------------------------------------------------
>>>> > all.q_at_node018 BIP 0/32/32 346.00
>>>> linux-x64
>>>> > a
>>>> >
>>>> > _______________________________________________
>>>> > StarCluster mailing list
>>>> > StarCluster_at_mit.edu
>>>> > http://mailman.mit.edu/mailman/listinfo/starcluster
>>>>
>>>
>>>
>>
>> _______________________________________________
>> StarCluster mailing list
>> StarCluster_at_mit.edu
>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>
>>
>

(image/png attachment: ScreenClip.png)

Received on Thu Jul 17 2014 - 22:03:38 EDT

This message: [ Message body ]
Next message: Rayson Ho: "Re: Instances are not accepting jobs when the slots are available."
Previous message: Rayson Ho: "Re: Instances are not accepting jobs when the slots are available."
In reply to: Rayson Ho: "Re: Instances are not accepting jobs when the slots are available."
Next in thread: Rayson Ho: "Re: Instances are not accepting jobs when the slots are available."
Reply: Rayson Ho: "Re: Instances are not accepting jobs when the slots are available."

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Navigation

Re: Instances are not accepting jobs when the slots are available.

Search:

Sort all by:

Navigation