StarCluster - Mailing List Archive

Re: Instances are not accepting jobs when the slots are available.

From: Chris Dagdigian <no email>
Date: Thu, 17 Jul 2014 14:48:55 -0400

Hi Jin,

The cluster is not accepting jobs into those open slots because your
compute nodes are reporting alarm state "a" - your first host has a
reported load average of 148!

Alarm state 'a' means "load threshold alarm level reached" it basically
means that the server load is high enough that the nodes are refusing
new work until the load average goes down.

All of those load alarm thresholds are configurable values within SGE so
you can revise them upwards if you want

Regards,
Chris


Jin Yu wrote:
> Hello,
>
> I just started a cluster of 20 c3.8xlarge instances, which have 32
> virtual cores in each. In my understanding, each instance should have
> 32 slots available to run the jobs by default. But after running it
> for a while, I found a lot of nodes are not running at the full speed.
>
> Following as an example, you can see node016 has only 13 jobs running
> and node017 has 9 jobs running, while node018 has 32 jobs running. I
> have another ~10000 jobs waiting in the queue, so it is not a matter
> of running out of jobs.
>
> Can anyone give me a hint what is going on here?
>
> Thanks!
> Jin
>
>
> all.q_at_node016 BIP 0/13/32 148.35 linux-x64
> a
> 784 0.55500 job.part.a sgeadmin r 07/17/2014 11:25:59
> 1
> 982 0.55500 job.part.a sgeadmin r 07/17/2014 14:43:59
> 1
> 1056 0.55500 job.part.a sgeadmin r 07/17/2014 16:34:44
> 1
> 1057 0.55500 job.part.a sgeadmin r 07/17/2014 16:34:44
> 1
> 1058 0.55500 job.part.a sgeadmin r 07/17/2014 16:34:59
> 1
> 1121 0.55500 job.part.a sgeadmin r 07/17/2014 17:22:44
> 1
> 1122 0.55500 job.part.a sgeadmin r 07/17/2014 17:22:44
> 1
> 1123 0.55500 job.part.a sgeadmin r 07/17/2014 17:22:44
> 1
> 1124 0.55500 job.part.a sgeadmin r 07/17/2014 17:22:44
> 1
> 1125 0.55500 job.part.a sgeadmin r 07/17/2014 17:22:44
> 1
> 1126 0.55500 job.part.a sgeadmin r 07/17/2014 17:22:44
> 1
> 1127 0.55500 job.part.a sgeadmin r 07/17/2014 17:22:44
> 1
> 1128 0.55500 job.part.a sgeadmin r 07/17/2014 17:22:44
> 1
> ---------------------------------------------------------------------------------
> all.q_at_node017 BIP 0/9/32 83.86 linux-x64
> a
> 568 0.55500 job.part.a sgeadmin r 07/17/2014 04:01:14
> 1
> 1001 0.55500 job.part.a sgeadmin r 07/17/2014 15:07:29
> 1
> 1002 0.55500 job.part.a sgeadmin r 07/17/2014 15:07:29
> 1
> 1072 0.55500 job.part.a sgeadmin r 07/17/2014 16:53:29
> 1
> 1116 0.55500 job.part.a sgeadmin r 07/17/2014 17:19:29
> 1
> 1117 0.55500 job.part.a sgeadmin r 07/17/2014 17:19:29
> 1
> 1118 0.55500 job.part.a sgeadmin r 07/17/2014 17:19:44
> 1
> 1119 0.55500 job.part.a sgeadmin r 07/17/2014 17:19:59
> 1
> 1120 0.55500 job.part.a sgeadmin r 07/17/2014 17:19:59
> 1
> ---------------------------------------------------------------------------------
> all.q_at_node018 BIP 0/32/32 346.00 linux-x64
> a
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
Received on Thu Jul 17 2014 - 14:49:00 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject