StarCluster - Mailing List Archive

Re: loadbalance error

From: Rayson Ho <no email>
Date: Wed, 11 Jan 2012 10:39:18 -0800 (PST)

The XML parser does not like the output of "qhost -xml". (We changed some minor XML output code in Grid Scheduler recently,
but as you have encountered this before in earlier versions, looks like our changes are not the cause of this issue.)


I just started a 1 node cluster and let the loadbalancer add another node, and it all seemed to work fine...from the error message
in your email, qhost exited with 1, and a number of things can cause qhost to exit with code 1.


Can you run from the interactive shell the following command on one of the nodes on EC2 when you encounter this problem
again??

% qhost -xml

And then send us the output. It can be an issue related to how the XML is generated in Grid Engine/Grid Scheduler, or it can be
something else in the XML parser.

Rayson

=================================
Open Grid Scheduler / Grid Engine
http://gridscheduler.sourceforge.net/

Scalable Grid Engine Support Program
http://www.scalablelogic.com/



________________________________
From: Wei Tao <wei.tao_at_tsibiocomputing.com>
To: starcluster_at_mit.edu
Sent: Wednesday, January 11, 2012 10:01 AM
Subject: [StarCluster] loadbalance error


Hi all,

I was running loadbalance. After a while, I got the following error. Can someone shed some light on this? This happened before with earlier versions of Starcluster as well.

>>> Loading full job history
!!! ERROR - command 'source /etc/profile && qhost -xml' failed with status 1
Traceback (most recent call last):
  File "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.py", line 251, in main
    sc.execute(args)
  File "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/commands/loadbalance.py", line 89, in execute
    lb.run(cluster)
  File "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/balancers/sge/__init__.py", line 583, in run
    if self.get_stats() == -1:
  File "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/balancers/sge/__init__.py", line 529, in get_stats
    self.stat.parse_qhost(qhostxml)
  File "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/balancers/sge/__init__.py", line 49, in parse_qhost
    doc = xml.dom.minidom.parseString(string)
  File "/usr/lib/python2.6/xml/dom/minidom.py", line 1928, in parseString
    return expatbuilder.parseString(string)
  File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 940, in parseString
    return builder.parseString(string)
  File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 223, in parseString
    parser.Parse(string, True)
ExpatError: syntax error: line 1, column 0

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)

/usr/local/bin/starcluster in <module>()
      7 if __name__ == '__main__':
      8     sys.exit(
----> 9         load_entry_point('StarCluster==0.93', 'console_scripts', 'starcluster')()
     10     )
     11 

/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.pyc in main()
    306     logger.configure_sc_logging()
    307     warn_debug_file_moved()
--> 308     StarClusterCLI().main()
    309 
    310 if __name__ == '__main__':

/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.pyc in main(self)
    283             log.debug(traceback.format_exc())
    284             print
--> 285             self.bug_found()
    286 
    287 

/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.pyc in bug_found(self)
    150         crashfile = open(static.CRASH_FILE, 'w')
    151         crashfile.write(header % "CRASH DETAILS")
--> 152         crashfile.write(session.stream.getvalue())
    153         crashfile.write(header % "SYSTEM INFO")
    154         crashfile.write("StarCluster: %s\n" % __version__)

/usr/lib/python2.6/StringIO.pyc in getvalue(self)
    268         """
    269         if self.buflist:
--> 270             self.buf += ''.join(self.buflist)
    271             self.buflist = []
    272         return self.buf

MemoryError: 

Thanks!

-Wei
_______________________________________________
StarCluster mailing list
StarCluster_at_mit.edu
http://mailman.mit.edu/mailman/listinfo/starcluster
Received on Wed Jan 11 2012 - 13:39:21 EST
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject