Hi Rayson & Justin,
Attached please find the crash report generated by the loadbalance and
another the output of the qhost -xml running on the master node. Hopefully
these provide clue on what went wrong.
Thanks for the help!
-Wei
On Wed, Jan 11, 2012 at 1:39 PM, Rayson Ho <raysonlogin_at_yahoo.com> wrote:
> The XML parser does not like the output of "qhost -xml". (We changed some
> minor XML output code in Grid Scheduler recently,
> but as you have encountered this before in earlier versions, looks like
> our changes are not the cause of this issue.)
>
>
> I just started a 1 node cluster and let the loadbalancer add another node,
> and it all seemed to work fine...from the error message
> in your email, qhost exited with 1, and a number of things can cause qhost
> to exit with code 1.
>
>
> Can you run from the interactive shell the following command on one of the
> nodes on EC2 when you encounter this problem
> again??
>
> % qhost -xml
>
> And then send us the output. It can be an issue related to how the XML is
> generated in Grid Engine/Grid Scheduler, or it can be
> something else in the XML parser.
>
> Rayson
>
> =================================
> Open Grid Scheduler / Grid Engine
> http://gridscheduler.sourceforge.net/
>
> Scalable Grid Engine Support Program
> http://www.scalablelogic.com/
>
>
>
> ________________________________
> From: Wei Tao <wei.tao_at_tsibiocomputing.com>
> To: starcluster_at_mit.edu
> Sent: Wednesday, January 11, 2012 10:01 AM
> Subject: [StarCluster] loadbalance error
>
>
> Hi all,
>
> I was running loadbalance. After a while, I got the following error. Can
> someone shed some light on this? This happened before with earlier versions
> of Starcluster as well.
>
> >>> Loading full job history
> !!! ERROR - command 'source /etc/profile && qhost -xml' failed with status
> 1
> Traceback (most recent call last):
> File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.py",
> line 251, in main
> sc.execute(args)
> File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/commands/loadbalance.py",
> line 89, in execute
> lb.run(cluster)
> File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/balancers/sge/__init__.py",
> line 583, in run
> if self.get_stats() == -1:
> File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/balancers/sge/__init__.py",
> line 529, in get_stats
> self.stat.parse_qhost(qhostxml)
> File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/balancers/sge/__init__.py",
> line 49, in parse_qhost
> doc = xml.dom.minidom.parseString(string)
> File "/usr/lib/python2.6/xml/dom/minidom.py", line 1928, in parseString
> return expatbuilder.parseString(string)
> File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 940, in
> parseString
> return builder.parseString(string)
> File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 223, in
> parseString
> parser.Parse(string, True)
> ExpatError: syntax error: line 1, column 0
>
> ---------------------------------------------------------------------------
> MemoryError Traceback (most recent call last)
>
> /usr/local/bin/starcluster in <module>()
> 7 if __name__ == '__main__':
> 8 sys.exit(
> ----> 9 load_entry_point('StarCluster==0.93', 'console_scripts',
> 'starcluster')()
> 10 )
> 11
>
> /usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.pyc
> in main()
> 306 logger.configure_sc_logging()
> 307 warn_debug_file_moved()
> --> 308 StarClusterCLI().main()
> 309
> 310 if __name__ == '__main__':
>
> /usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.pyc
> in main(self)
> 283 log.debug(traceback.format_exc())
> 284 print
> --> 285 self.bug_found()
> 286
> 287
>
> /usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.pyc
> in bug_found(self)
> 150 crashfile = open(static.CRASH_FILE, 'w')
> 151 crashfile.write(header % "CRASH DETAILS")
> --> 152 crashfile.write(session.stream.getvalue())
> 153 crashfile.write(header % "SYSTEM INFO")
> 154 crashfile.write("StarCluster: %s\n" % __version__)
>
> /usr/lib/python2.6/StringIO.pyc in getvalue(self)
> 268 """
> 269 if self.buflist:
> --> 270 self.buf += ''.join(self.buflist)
> 271 self.buflist = []
> 272 return self.buf
>
> MemoryError:
>
> Thanks!
>
> -Wei
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
--
Wei Tao, Ph.D.
TSI Biocomputing LLC
617-564-0934
Received on Tue Jan 17 2012 - 18:21:37 EST
This archive was generated by
hypermail 2.3.0.