Re: loadbalance

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

From: Ryan Golhar <no email>
Date: Thu, 19 Sep 2013 01:51:42 -0400

Its happening again.

output from qstat (truncated):

job-ID prior name user state submit/start at queue
                     slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
   1211 0.55500 j1-00596-0 ec2-user r 09/19/2013 05:27:35
all.q_at_node003 8
   1212 0.55500 j1-00604-0 ec2-user r 09/19/2013 05:27:35
all.q_at_node005 8
   1214 0.55500 j1-00650-0 ec2-user r 09/19/2013 05:27:35
all.q_at_node002 8
   1215 0.55500 j1-00984-0 ec2-user r 09/19/2013 05:27:35
all.q_at_node009 8
   1216 0.55500 j1-01025-0 ec2-user r 09/19/2013 05:27:35
all.q_at_node017 8
   1217 0.55500 j1-01026-0 ec2-user r 09/19/2013 05:27:35
all.q_at_node010 8
   1218 0.55500 j1-01026 ec2-user r 09/19/2013 05:27:35
all.q_at_node006 8
   1219 0.55500 j1-01053 ec2-user r 09/19/2013 05:27:35
all.q_at_node007 8
   1220 0.55500 j1-01106-0 ec2-user r 09/19/2013 05:27:35
all.q_at_node012 8
   1221 0.55500 j1-01119-0 ec2-user r 09/19/2013 05:27:35
all.q_at_node016 8
   1222 0.55500 j1-01175 ec2-user r 09/19/2013 05:27:35
all.q_at_node020 8
   1223 0.55500 j1-01178 ec2-user r 09/19/2013 05:27:35
all.q_at_node018 8
   1224 0.55500 j1-01184-0 ec2-user r 09/19/2013 05:27:35
all.q_at_node019 8
   1225 0.55500 j1-01184-0 ec2-user r 09/19/2013 05:27:35
all.q_at_node015 8
   1226 0.55500 j1-01184 ec2-user r 09/19/2013 05:27:35
all.q_at_node014 8
   1227 0.55500 j1-01190-0 ec2-user r 09/19/2013 05:27:35
all.q_at_node011 8
   1228 0.55500 j1-01190-0 ec2-user r 09/19/2013 05:27:35
all.q_at_master 8
   1229 0.55500 j1-01190 ec2-user r 09/19/2013 05:27:35
all.q_at_node013 8
   1230 0.55500 j1-01244-0 ec2-user r 09/19/2013 05:27:35
all.q_at_node008 8
   1231 0.55500 j1-01244-0 ec2-user r 09/19/2013 05:27:35
all.q_at_node001 8
   1232 0.55500 j1-01244 ec2-user r 09/19/2013 05:45:05
all.q_at_node004 8
   1233 0.55500 j1-01260-0 ec2-user qw 09/19/2013 05:27:28
                         8
   1234 0.55500 j1-01260-0 ec2-user qw 09/19/2013 05:27:28
                         8
   1235 0.55500 j1-01260 ec2-user qw 09/19/2013 05:27:28
                         8
   1236 0.55500 j1-01265-0 ec2-user qw 09/19/2013 05:27:28
                         8
   1237 0.55500 j1-01265-0 ec2-user qw 09/19/2013 05:27:28
                         8
   1238 0.55500 j1-01265 ec2-user qw 09/19/2013 05:27:28
                         8
   1239 0.55500 j1-01272-0 ec2-user qw 09/19/2013 05:27:28
                         8

qacct:
Total System Usage
    WALLCLOCK UTIME STIME CPU MEMORY
                IO IOW
================================================================================================================
        20647 12987.584 6567.578 25492.947 35430.706
          1771.872 0.000

qhost:
HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO
SWAPUS
-------------------------------------------------------------------------------
global - - - - - -
  -
master linux-x64 8 10.76 6.8G 3.8G 0.0
0.0
node001 linux-x64 8 0.03 6.8G 537.4M 0.0
0.0
node002 linux-x64 8 0.07 6.8G 539.7M 0.0
0.0
node003 linux-x64 8 4.27 6.8G 3.7G 0.0
0.0
node004 linux-x64 8 3.52 6.8G 283.0M 0.0
0.0
node005 linux-x64 8 4.36 6.8G 3.7G 0.0
0.0
node006 linux-x64 8 0.04 6.8G 642.3M 0.0
0.0
node007 linux-x64 8 0.12 6.8G 468.3M 0.0
0.0
node008 linux-x64 8 4.70 6.8G 3.8G 0.0
0.0
node009 linux-x64 8 0.04 6.8G 607.2M 0.0
0.0
node010 linux-x64 8 4.31 6.8G 3.7G 0.0
0.0
node011 linux-x64 8 3.96 6.8G 3.6G 0.0
0.0
node012 linux-x64 8 1.61 6.8G 280.4M 0.0
0.0
node013 linux-x64 8 1.31 6.8G 582.7M 0.0
0.0
node014 linux-x64 8 1.27 6.8G 375.2M 0.0
0.0
node015 linux-x64 8 1.19 6.8G 996.4M 0.0
0.0
node016 linux-x64 8 1.43 6.8G 349.8M 0.0
0.0
node017 linux-x64 8 1.40 6.8G 567.0M 0.0
0.0
node018 linux-x64 8 1.36 6.8G 262.4M 0.0
0.0
node019 linux-x64 8 1.43 6.8G 278.5M 0.0
0.0
node020 linux-x64 8 1.36 6.8G 402.6M 0.0
0.0

On Wed, Sep 18, 2013 at 3:01 PM, Rajat Banerjee <rajatb_at_post.harvard.edu>wrote:

> That looks normal. Can you please send qstat, qacct, and qhost output from
> when you're seeing the problem? Thanks.
>
>
> On Wed, Sep 18, 2013 at 2:47 PM, Ryan Golhar <ngsbioinformatics_at_gmail.com>wrote:
>
>> I've since terminated the cluster and an experimenting with different set
>> up, but here's the output from qstat and qhost;
>>
>> ec2-user_at_master:~$ qstat
>> job-ID prior name user state submit/start at queue
>> slots ja-task-ID
>>
>> -----------------------------------------------------------------------------------------------------------------
>> 4 0.55500 j1-00493-0 ec2-user r 09/18/2013 17:38:44
>> all.q_at_node001 8
>> 6 0.55500 j1-00508-0 ec2-user r 09/18/2013 17:45:44
>> all.q_at_node002 8
>> 7 0.55500 j1-00525-0 ec2-user r 09/18/2013 17:46:29
>> all.q_at_node003 8
>> 8 0.55500 j1-00541-0 ec2-user r 09/18/2013 17:54:59
>> all.q_at_node004 8
>> 9 0.55500 j1-00565-0 ec2-user r 09/18/2013 17:55:44
>> all.q_at_node005 8
>> 10 0.55500 j1-00596-0 ec2-user r 09/18/2013 17:58:59
>> all.q_at_node006 8
>> 11 0.55500 j1-00604-0 ec2-user r 09/18/2013 18:05:14
>> all.q_at_node007 8
>> 12 0.55500 j1-00625-0 ec2-user r 09/18/2013 18:05:14
>> all.q_at_node008 8
>> 13 0.55500 j1-00650-0 ec2-user r 09/18/2013 18:05:14
>> all.q_at_node009 8
>> 18 0.55500 j1-00734-0 ec2-user r 09/18/2013 18:07:29
>> all.q_at_node010 8
>> 19 0.55500 j1-00738-0 ec2-user r 09/18/2013 18:16:59
>> all.q_at_node011 8
>> 20 0.55500 j1-00739-0 ec2-user r 09/18/2013 18:16:59
>> all.q_at_node012 8
>> 21 0.55500 j1-00770 ec2-user r 09/18/2013 18:16:59
>> all.q_at_node013 8
>> 22 0.55500 j1-00806-0 ec2-user r 09/18/2013 18:16:59
>> all.q_at_node014 8
>> 23 0.55500 j1-00825-0 ec2-user r 09/18/2013 18:16:59
>> all.q_at_node015 8
>> 24 0.55500 j1-00826-0 ec2-user r 09/18/2013 18:16:59
>> all.q_at_node016 8
>> 25 0.55500 j1-00846-0 ec2-user r 09/18/2013 18:16:59
>> all.q_at_node017 8
>> 26 0.55500 j1-00847-0 ec2-user r 09/18/2013 18:16:59
>> all.q_at_node018 8
>> 27 0.55500 j1-00913 ec2-user r 09/18/2013 18:16:59
>> all.q_at_node019 8
>> 28 0.55500 j1-00914-0 ec2-user r 09/18/2013 18:16:59
>> all.q_at_node020 8
>> 29 0.55500 j1-00914 ec2-user r 09/18/2013 18:26:29
>> all.q_at_node021 8
>> 30 0.55500 j1-00922 ec2-user r 09/18/2013 18:26:29
>> all.q_at_node022 8
>> 31 0.55500 j1-00977 ec2-user r 09/18/2013 18:26:29
>> all.q_at_node023 8
>> 32 0.55500 j1-00984-0 ec2-user r 09/18/2013 18:26:29
>> all.q_at_node024 8
>> 33 0.55500 j1-00984 ec2-user r 09/18/2013 18:26:29
>> all.q_at_node025 8
>> 34 0.55500 j1-00998-0 ec2-user r 09/18/2013 18:26:29
>> all.q_at_node026 8
>> 35 0.55500 j1-01010-0 ec2-user r 09/18/2013 18:26:29
>> all.q_at_node027 8
>> 36 0.55500 j1-01019-0 ec2-user r 09/18/2013 18:26:29
>> all.q_at_node028 8
>> 37 0.55500 j1-01025-0 ec2-user r 09/18/2013 18:26:29
>> all.q_at_node029 8
>> 38 0.55500 j1-01026-0 ec2-user r 09/18/2013 18:26:29
>> all.q_at_node030 8
>>
>> ec2-user_at_master:~$ qhost
>> HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO
>> SWAPUS
>>
>> -------------------------------------------------------------------------------
>> global - - - - - -
>> -
>> node001 linux-x64 8 7.74 6.8G 3.8G 0.0
>> 0.0
>> node002 linux-x64 8 7.93 6.8G 3.7G 0.0
>> 0.0
>> node003 linux-x64 8 7.68 6.8G 3.7G 0.0
>> 0.0
>> node004 linux-x64 8 7.86 6.8G 3.8G 0.0
>> 0.0
>> node005 linux-x64 8 7.87 6.8G 3.7G 0.0
>> 0.0
>> node006 linux-x64 8 7.66 6.8G 3.7G 0.0
>> 0.0
>> node007 linux-x64 8 0.01 6.8G 564.8M 0.0
>> 0.0
>> node008 linux-x64 8 0.01 6.8G 493.6M 0.0
>> 0.0
>> node009 linux-x64 8 0.02 6.8G 564.4M 0.0
>> 0.0
>> node010 linux-x64 8 7.85 6.8G 3.7G 0.0
>> 0.0
>> node011 linux-x64 8 7.53 6.8G 3.7G 0.0
>> 0.0
>> node012 linux-x64 8 7.57 6.8G 3.6G 0.0
>> 0.0
>> node013 linux-x64 8 7.71 6.8G 3.7G 0.0
>> 0.0
>> node014 linux-x64 8 7.49 6.8G 3.7G 0.0
>> 0.0
>> node015 linux-x64 8 7.51 6.8G 3.7G 0.0
>> 0.0
>> node016 linux-x64 8 7.50 6.8G 3.6G 0.0
>> 0.0
>> node017 linux-x64 8 7.89 6.8G 3.7G 0.0
>> 0.0
>> node018 linux-x64 8 7.50 6.8G 3.7G 0.0
>> 0.0
>> node019 linux-x64 8 7.52 6.8G 3.7G 0.0
>> 0.0
>> node020 linux-x64 8 7.68 6.8G 3.6G 0.0
>> 0.0
>> node021 linux-x64 8 7.16 6.8G 3.6G 0.0
>> 0.0
>> node022 linux-x64 8 6.99 6.8G 3.6G 0.0
>> 0.0
>> node023 linux-x64 8 6.80 6.8G 3.6G 0.0
>> 0.0
>> node024 linux-x64 8 7.20 6.8G 3.6G 0.0
>> 0.0
>> node025 linux-x64 8 6.86 6.8G 3.6G 0.0
>> 0.0
>> node026 linux-x64 8 7.24 6.8G 3.6G 0.0
>> 0.0
>> node027 linux-x64 8 6.88 6.8G 3.7G 0.0
>> 0.0
>> node028 linux-x64 8 6.28 6.8G 3.6G 0.0
>> 0.0
>> node029 linux-x64 8 7.42 6.8G 3.6G 0.0
>> 0.0
>> node030 linux-x64 8 0.10 6.8G 390.4M 0.0
>> 0.0
>> node031 linux-x64 8 0.06 6.8G 135.0M 0.0
>> 0.0
>> node032 linux-x64 8 0.04 6.8G 135.3M 0.0
>> 0.0
>> node033 linux-x64 8 0.07 6.8G 135.6M 0.0
>> 0.0
>> node034 linux-x64 8 0.10 6.8G 134.9M 0.0
>> 0.0
>>
>>
>> I never saw anything unusual
>>
>>
>> On Wed, Sep 18, 2013 at 10:40 AM, Rajat Banerjee <rajatb_at_post.harvard.edu
>> > wrote:
>>
>>> Ryan,
>>> Could you put the output of qhost and qstat into a text file and send it
>>> back to the list? That's what feeds the load balancer those stats.
>>>
>>> Thanks,
>>> Rajat
>>>
>>>
>>> On Tue, Sep 17, 2013 at 11:47 PM, Ryan Golhar <
>>> ngsbioinformatics_at_gmail.com> wrote:
>>>
>>>> I'm running a cluster with over 800 jobs queued....and I'm running
>>>> loadbalance. Every other query by loadbalance shows Avg job duration and
>>>> wait time of 0 secs. Why is this? It hasn't yet caused a problem, but
>>>> seems odd....
>>>>
>>>> >>> Loading full job history
>>>> Execution hosts: 19
>>>> Queued jobs: 791
>>>> Oldest queued job: 2013-09-17 22:19:23
>>>> Avg job duration: 3559 secs
>>>> Avg job wait time: 12389 secs
>>>> Last cluster modification time: 2013-09-18 00:11:31
>>>> >>> Not adding nodes: already at or above maximum (1)
>>>> >>> Sleeping...(looping again in 60 secs)
>>>>
>>>> Execution hosts: 19
>>>> Queued jobs: 791
>>>> Oldest queued job: 2013-09-17 22:19:23
>>>> Avg job duration: 0 secs
>>>> Avg job wait time: 0 secs
>>>> Last cluster modification time: 2013-09-18 00:11:31
>>>> >>> Not adding nodes: already at or above maximum (1)
>>>> >>> Sleeping...(looping again in 60 secs)
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> StarCluster mailing list
>>>> StarCluster_at_mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>>
>>>>
>>>
>>
>> _______________________________________________
>> StarCluster mailing list
>> StarCluster_at_mit.edu
>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>
>>
>
Received on Thu Sep 19 2013 - 01:51:46 EDT

This message: [ Message body ]
Next message: Ryan Golhar: "Trying to remove nodes gives an error"
Previous message: Rajat Banerjee: "Re: loadbalance"
In reply to: Rajat Banerjee: "Re: loadbalance"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Navigation

Re: loadbalance

Search:

Sort all by:

Navigation