StarCluster - Mailing List Archive

Re: Possible bug in loadbalancer?

From: Rajat Banerjee <no email>
Date: Tue, 17 May 2011 09:37:30 -0400

Hi Don,
Did this halt your ELB and force you to restart it? If not, I wouldn't
worry about it. When SGE is making internal changes (adding a node,
removing a node, or just starting up), the calls to qhost or qstat
will periodically return a bad exit code, causing the following two
errors you noticed:
PID: 1822 ssh.py:397 - ERROR - command 'source /etc/profile && qhost
-xml' failed with status 127
PID: 1822 ssh.py:397 - ERROR - command 'source /etc/profile && qstat
-q all.q -u "*" -xml' failed with status 127


If the ELB was able to continue working, I would not consider it a
bug. qstat and qhost will be called again in the next polling interval
and the data store will be populated with job data then.

Thanks for you feedback, and let us know if you have more issues.
Raj


On Tue, May 17, 2011 at 12:15 AM, Don MacMillen <macd_at_physware.com> wrote:
> Hi,
>
> Trolling though the log files I found several errors which were all of the
> following form.
> I believe I have been using ELB correctly, but anything is possible and I do
> not yet
> have much experience with it.  This version is from the git repo which I
> cloned last
> Friday.
>
> I doubt that I can do much to reproduce this error, so if this trace helps,
> that's great.
> I will keep an eye out for other misbehavior as well.
>
> Thanks for the great work you guys have put into starcluster and the elb.
>
> Regards,
>
> Don MacMillen
> PhysWare
>
>
> PID: 1822 __init__.py:481 - INFO - Jobstats cache is not full. Pulling full
> job history.
> PID: 1822 __init__.py:486 - DEBUG - getting past 10800 seconds worth of job
> history.
> PID: 1822 ssh.py:397 - ERROR - command 'source /etc/profile && qhost -xml'
> failed with status 127
> PID: 1822 ssh.py:397 - ERROR - command 'source /etc/profile && qstat -q
> all.q -u "*" -xml' failed with status 127
> PID: 1822 ssh.py:400 - DEBUG - command source /etc/profile && qacct -j -b
> 201105162111 failed with status 127
> PID: 1822 __init__.py:524 - DEBUG - sizes: qhost: 30, qstat: 30, qacct:
> 30.
> PID: 1822 cli.py:184 - DEBUG - Traceback (most recent call
> last):
>   File "build/bdist.linux-i686/egg/starcluster/cli.py", line 160, in
> main
>
> sc.execute(args)
>   File "build/bdist.linux-i686/egg/starcluster/commands/loadbalance.py",
> line 91, in execute
>
> lb.run(cluster)
>   File "build/bdist.linux-i686/egg/starcluster/balancers/sge/__init__.py",
> line 570, in run
>     if self.get_stats() ==
> -1:
>   File "build/bdist.linux-i686/egg/starcluster/balancers/sge/__init__.py",
> line 525, in get_stats
>
> self.stat.parse_qhost(qhostxml)
>   File "build/bdist.linux-i686/egg/starcluster/balancers/sge/__init__.py",
> line 50, in parse_qhost
>     doc =
> xml.dom.minidom.parseString(string)
>   File "/usr/lib/python2.6/xml/dom/minidom.py", line 1928, in
> parseString
>     return
> expatbuilder.parseString(string)
>   File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 940, in
> parseString
>     return
> builder.parseString(string)
>   File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 223, in
> parseString
>     parser.Parse(string,
> True)
> ExpatError: syntax error: line 1, column
> 0
>
> PID: 1822 cli.py:129 - ERROR - Oops! Looks like you've found a bug in
> StarCluster
> PID: 1822 cli.py:130 - ERROR - Debug file written to:
> /tmp/starcluster-debug-staruser.log
> PID: 1822 cli.py:131 - ERROR - Look for lines starting with PID:
> 1822
> PID: 1822 cli.py:132 - ERROR - Please submit this file, minus any private
> information,
> PID: 1822 cli.py:133 - ERROR - to
> starcluster_at_mit.edu
> PID: 1822 ssh.py:534 - DEBUG - __del__
> called
> PID: 1822 ssh.py:534 - DEBUG - __del__
> called
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
>
Received on Tue May 17 2011 - 09:37:53 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject