Re: [Starcluster] Load Balancer Problems
Hey Rajat,
So I tested out the load balancer some today. We ran into 2 problems. The
first is that we submitted an array of jobs to the queue. The balancer is
treating the array as one job and not recognizing that it needs to open up
more nodes. The second problem is that the logic in closing down nodes isn't
taking into account the hour limits beyond the first 45. For example if our
instance has been up for 61 minutes we've bought the second hour
and don't want to just close that instance. I have attached the xml output.
Best
Amaro Taylor
RES Group, Inc.
1 Broadway • Cambridge, MA 02142 • U.S.A.
Tel: 310 880-1906 (Direct) • Fax: 617-812-8042 • Email:
amaro.taylor_at_resgroupinc.com
Disclaimer: The information contained in this email message may be
confidential. Please be careful if you forward, copy or print this message.
If you have received this email in error, please immediately notify the
sender and delete the message.
On Sun, Aug 1, 2010 at 4:12 PM, Rajat Banerjee <rbanerj_at_fas.harvard.edu>wrote:
> Hi,
> I made a fix and committed it.
>
>
> http://github.com/rqbanerjee/StarCluster/commit/bace3075d9ab2f891f1b50981f5ef657e7bb0cfb
>
> You can pull from github to get the latest stuff. I switched my basic
> "qstat -xml" to a larger search: 'qstat -q all.q -u \"*\" -xml' , it
> seems to get the entire job queue on my cluster. Please let me know if
> it gets the right job queue on your cluster.
>
> Thanks,
> Rajat
>
> On Fri, Jul 30, 2010 at 4:48 PM, Rajat Banerjee <rbanerj_at_fas.harvard.edu>
> wrote:
> > Hey Amaro,
> > Thanks for the feedback. It looks like your SGE queue is much more
> > sophisticated than mine. If I run "qstat -xml" it outputs a ton of
> > info, but I'm guessing that yours would not.
> >
> > I assume you're using the latest code, in "develop" mode? (Did you run
> > "python setup.py develop" when you started working?)
> >
> > If so, open up the python file starcluster/balancers/sge/__init__.py
> > and change this line #342:
> >
> > qstatXml = '\n'.join(master.ssh.execute('source /etc/profile && qstat
> -xml', \
> > log_output=False))
> >
> > to the following:
> >
> > qstatXml = '\n'.join(master.ssh.execute('source /etc/profile && qstat
> > -xml -q all.q -f -u "*"', \
> > log_output=False))
> >
> > I modified the args to qstat. If that works for you, I can test it and
> > check it into the branch.
> > Thanks,
> > Rajat
> >
> > On Fri, Jul 30, 2010 at 4:40 PM, Amaro Taylor
> > <amaro.taylor_at_resgroupinc.com> wrote:
> >> Hey,
> >>
> >> So I was testing out the Load Balancer today and it doesnt appear to be
> >> working. Here is the output I was getting and the output from the job on
> >> startcluster.
> >>
> >> ssh.py:248 - ERROR - command source /etc/profile && qacct -j -b
> 201007301725
> >> failed with status 1
> >>>>> Oldest job is from None. # queued jobs = 0. # hosts = 2.
> >>>>> Avg job duration = 0 sec, Avg wait time = 0 sec.
> >>>>> Cluster change was made less than 180 seconds ago (2010-07-30
> >>>>> 20:24:13.398974).
> >>>>> Not changing cluster size until cluster stabilizes.
> >>>>> Sleeping, looping again in 60 seconds.
> >>
> >>
> >> It says 0 queued jobs but thats not accurate.
> >> this is what qstat says on the master node
> >>
> >>
> #########################################################################
> >> 1 0.55500 Bone_Estim sgeadmin qw 07/30/2010 20:26:20 1
> >> 7-1000:1
> >> sgeadmin_at_domU-12-31-39-01-5D-67:~/jacobian-parallel/test/bone$ qstat -q
> >> all.q -f -u "*"
> >> queuename qtype resv/used/tot. load_avg arch
> >> states
> >>
> ---------------------------------------------------------------------------------
> >> all.q_at_domU-12-31-39-01-5C-97.c BIP 0/1/1 0.52 lx24-x86
> >> 1 0.55500 Bone_Estim sgeadmin r 07/30/2010 20:29:03 1
> 6
> >>
> ---------------------------------------------------------------------------------
> >> all.q_at_domU-12-31-39-01-5D-67.c BIP 0/1/1 1.22 lx24-x86
> >> 1 0.55500 Bone_Estim sgeadmin r 07/30/2010 20:28:33 1
> 5
> >>
> >>
> ############################################################################
> >> - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING
> JOBS
> >>
> ############################################################################
> >> 1 0.55500 Bone_Estim sgeadmin qw 07/30/2010 20:26:20 1
> >> 7-1000:1
> >> sgeadmin_at_domU-12-31-39-01-5D-67:~/jacobian-parallel/test/bone$ qstat -q
> >> all.q -f -u "*"
> >> queuename qtype resv/used/tot. load_avg arch
> >> states
> >>
> ---------------------------------------------------------------------------------
> >> all.q_at_domU-12-31-39-01-5C-97.c BIP 0/1/1 0.63 lx24-x86
> >> 1 0.55500 Bone_Estim sgeadmin r 07/30/2010 20:31:03 1
> 8
> >>
> ---------------------------------------------------------------------------------
> >> all.q_at_domU-12-31-39-01-5D-67.c BIP 0/1/1 1.38 lx24-x86
> >> 1 0.55500 Bone_Estim sgeadmin r 07/30/2010 20:28:33 1
> 5
> >>
> >> Any suggestions?
> >>
> >>
> >>
> >> Best,
> >> Amaro Taylor
> >> RES Group, Inc.
> >> 1 Broadway • Cambridge, MA 02142 • U.S.A.
> >> Tel: 310 880-1906 (Direct) • Fax: 617-812-8042 • Email:
> >> amaro.taylor_at_resgroupinc.com
> >>
> >> Disclaimer: The information contained in this email message may be
> >> confidential. Please be careful if you forward, copy or print this
> message.
> >> If you have received this email in error, please immediately notify the
> >> sender and delete the message.
> >>
> >> _______________________________________________
> >> Starcluster mailing list
> >> Starcluster_at_mit.edu
> >> http://mailman.mit.edu/mailman/listinfo/starcluster
> >>
> >>
> >
>
- application/octet-stream attachment: qstat.out
Received on Mon Aug 02 2010 - 12:52:04 EDT
This archive was generated by
hypermail 2.3.0.