StarCluster - Mailing List Archive

Re: loadbalance

From: Rajat Banerjee <no email>
Date: Wed, 18 Sep 2013 15:01:25 -0400

That looks normal. Can you please send qstat, qacct, and qhost output from
when you're seeing the problem? Thanks.


On Wed, Sep 18, 2013 at 2:47 PM, Ryan Golhar <ngsbioinformatics_at_gmail.com>wrote:

> I've since terminated the cluster and an experimenting with different set
> up, but here's the output from qstat and qhost;
>
> ec2-user_at_master:~$ qstat
> job-ID prior name user state submit/start at queue
> slots ja-task-ID
>
> -----------------------------------------------------------------------------------------------------------------
> 4 0.55500 j1-00493-0 ec2-user r 09/18/2013 17:38:44
> all.q_at_node001 8
> 6 0.55500 j1-00508-0 ec2-user r 09/18/2013 17:45:44
> all.q_at_node002 8
> 7 0.55500 j1-00525-0 ec2-user r 09/18/2013 17:46:29
> all.q_at_node003 8
> 8 0.55500 j1-00541-0 ec2-user r 09/18/2013 17:54:59
> all.q_at_node004 8
> 9 0.55500 j1-00565-0 ec2-user r 09/18/2013 17:55:44
> all.q_at_node005 8
> 10 0.55500 j1-00596-0 ec2-user r 09/18/2013 17:58:59
> all.q_at_node006 8
> 11 0.55500 j1-00604-0 ec2-user r 09/18/2013 18:05:14
> all.q_at_node007 8
> 12 0.55500 j1-00625-0 ec2-user r 09/18/2013 18:05:14
> all.q_at_node008 8
> 13 0.55500 j1-00650-0 ec2-user r 09/18/2013 18:05:14
> all.q_at_node009 8
> 18 0.55500 j1-00734-0 ec2-user r 09/18/2013 18:07:29
> all.q_at_node010 8
> 19 0.55500 j1-00738-0 ec2-user r 09/18/2013 18:16:59
> all.q_at_node011 8
> 20 0.55500 j1-00739-0 ec2-user r 09/18/2013 18:16:59
> all.q_at_node012 8
> 21 0.55500 j1-00770 ec2-user r 09/18/2013 18:16:59
> all.q_at_node013 8
> 22 0.55500 j1-00806-0 ec2-user r 09/18/2013 18:16:59
> all.q_at_node014 8
> 23 0.55500 j1-00825-0 ec2-user r 09/18/2013 18:16:59
> all.q_at_node015 8
> 24 0.55500 j1-00826-0 ec2-user r 09/18/2013 18:16:59
> all.q_at_node016 8
> 25 0.55500 j1-00846-0 ec2-user r 09/18/2013 18:16:59
> all.q_at_node017 8
> 26 0.55500 j1-00847-0 ec2-user r 09/18/2013 18:16:59
> all.q_at_node018 8
> 27 0.55500 j1-00913 ec2-user r 09/18/2013 18:16:59
> all.q_at_node019 8
> 28 0.55500 j1-00914-0 ec2-user r 09/18/2013 18:16:59
> all.q_at_node020 8
> 29 0.55500 j1-00914 ec2-user r 09/18/2013 18:26:29
> all.q_at_node021 8
> 30 0.55500 j1-00922 ec2-user r 09/18/2013 18:26:29
> all.q_at_node022 8
> 31 0.55500 j1-00977 ec2-user r 09/18/2013 18:26:29
> all.q_at_node023 8
> 32 0.55500 j1-00984-0 ec2-user r 09/18/2013 18:26:29
> all.q_at_node024 8
> 33 0.55500 j1-00984 ec2-user r 09/18/2013 18:26:29
> all.q_at_node025 8
> 34 0.55500 j1-00998-0 ec2-user r 09/18/2013 18:26:29
> all.q_at_node026 8
> 35 0.55500 j1-01010-0 ec2-user r 09/18/2013 18:26:29
> all.q_at_node027 8
> 36 0.55500 j1-01019-0 ec2-user r 09/18/2013 18:26:29
> all.q_at_node028 8
> 37 0.55500 j1-01025-0 ec2-user r 09/18/2013 18:26:29
> all.q_at_node029 8
> 38 0.55500 j1-01026-0 ec2-user r 09/18/2013 18:26:29
> all.q_at_node030 8
>
> ec2-user_at_master:~$ qhost
> HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO
> SWAPUS
>
> -------------------------------------------------------------------------------
> global - - - - - -
> -
> node001 linux-x64 8 7.74 6.8G 3.8G 0.0
> 0.0
> node002 linux-x64 8 7.93 6.8G 3.7G 0.0
> 0.0
> node003 linux-x64 8 7.68 6.8G 3.7G 0.0
> 0.0
> node004 linux-x64 8 7.86 6.8G 3.8G 0.0
> 0.0
> node005 linux-x64 8 7.87 6.8G 3.7G 0.0
> 0.0
> node006 linux-x64 8 7.66 6.8G 3.7G 0.0
> 0.0
> node007 linux-x64 8 0.01 6.8G 564.8M 0.0
> 0.0
> node008 linux-x64 8 0.01 6.8G 493.6M 0.0
> 0.0
> node009 linux-x64 8 0.02 6.8G 564.4M 0.0
> 0.0
> node010 linux-x64 8 7.85 6.8G 3.7G 0.0
> 0.0
> node011 linux-x64 8 7.53 6.8G 3.7G 0.0
> 0.0
> node012 linux-x64 8 7.57 6.8G 3.6G 0.0
> 0.0
> node013 linux-x64 8 7.71 6.8G 3.7G 0.0
> 0.0
> node014 linux-x64 8 7.49 6.8G 3.7G 0.0
> 0.0
> node015 linux-x64 8 7.51 6.8G 3.7G 0.0
> 0.0
> node016 linux-x64 8 7.50 6.8G 3.6G 0.0
> 0.0
> node017 linux-x64 8 7.89 6.8G 3.7G 0.0
> 0.0
> node018 linux-x64 8 7.50 6.8G 3.7G 0.0
> 0.0
> node019 linux-x64 8 7.52 6.8G 3.7G 0.0
> 0.0
> node020 linux-x64 8 7.68 6.8G 3.6G 0.0
> 0.0
> node021 linux-x64 8 7.16 6.8G 3.6G 0.0
> 0.0
> node022 linux-x64 8 6.99 6.8G 3.6G 0.0
> 0.0
> node023 linux-x64 8 6.80 6.8G 3.6G 0.0
> 0.0
> node024 linux-x64 8 7.20 6.8G 3.6G 0.0
> 0.0
> node025 linux-x64 8 6.86 6.8G 3.6G 0.0
> 0.0
> node026 linux-x64 8 7.24 6.8G 3.6G 0.0
> 0.0
> node027 linux-x64 8 6.88 6.8G 3.7G 0.0
> 0.0
> node028 linux-x64 8 6.28 6.8G 3.6G 0.0
> 0.0
> node029 linux-x64 8 7.42 6.8G 3.6G 0.0
> 0.0
> node030 linux-x64 8 0.10 6.8G 390.4M 0.0
> 0.0
> node031 linux-x64 8 0.06 6.8G 135.0M 0.0
> 0.0
> node032 linux-x64 8 0.04 6.8G 135.3M 0.0
> 0.0
> node033 linux-x64 8 0.07 6.8G 135.6M 0.0
> 0.0
> node034 linux-x64 8 0.10 6.8G 134.9M 0.0
> 0.0
>
>
> I never saw anything unusual
>
>
> On Wed, Sep 18, 2013 at 10:40 AM, Rajat Banerjee <rajatb_at_post.harvard.edu>wrote:
>
>> Ryan,
>> Could you put the output of qhost and qstat into a text file and send it
>> back to the list? That's what feeds the load balancer those stats.
>>
>> Thanks,
>> Rajat
>>
>>
>> On Tue, Sep 17, 2013 at 11:47 PM, Ryan Golhar <
>> ngsbioinformatics_at_gmail.com> wrote:
>>
>>> I'm running a cluster with over 800 jobs queued....and I'm running
>>> loadbalance. Every other query by loadbalance shows Avg job duration and
>>> wait time of 0 secs. Why is this? It hasn't yet caused a problem, but
>>> seems odd....
>>>
>>> >>> Loading full job history
>>> Execution hosts: 19
>>> Queued jobs: 791
>>> Oldest queued job: 2013-09-17 22:19:23
>>> Avg job duration: 3559 secs
>>> Avg job wait time: 12389 secs
>>> Last cluster modification time: 2013-09-18 00:11:31
>>> >>> Not adding nodes: already at or above maximum (1)
>>> >>> Sleeping...(looping again in 60 secs)
>>>
>>> Execution hosts: 19
>>> Queued jobs: 791
>>> Oldest queued job: 2013-09-17 22:19:23
>>> Avg job duration: 0 secs
>>> Avg job wait time: 0 secs
>>> Last cluster modification time: 2013-09-18 00:11:31
>>> >>> Not adding nodes: already at or above maximum (1)
>>> >>> Sleeping...(looping again in 60 secs)
>>>
>>>
>>>
>>> _______________________________________________
>>> StarCluster mailing list
>>> StarCluster_at_mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>
>>>
>>
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
>
Received on Wed Sep 18 2013 - 15:01:50 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject