StarCluster - Mailing List Archive

Re: loadbalance error

From: Rajat Banerjee <no email>
Date: Wed, 11 Jan 2012 18:27:17 -0500

I don't think it is a memory leak within the load balancer. elb does not
endlessly add to the host queue, see the first few lines of parse_qhost:

def parse_qhost(self, string):
        """
        this function parses qhost -xml output and makes a neat array
        takes in a string, so we can pipe in output from ssh.exec('qhost -xml')
        """

        self.hosts = [] # clear the old hosts

https://github.com/jtriley/StarCluster/blob/develop/starcluster/balancers/sge/__init__.py

Looks to be an XML parser issue. Maybe ELB is using the parser incorrectly.
How truly huge is the qhosts -xml output? I would think it would be about 1
xml record per host in the cluster, and the XML output contains many many
parameters about the status of the host.

On Wed, Jan 11, 2012 at 5:09 PM, <starcluster-request_at_mit.edu> wrote:

> Send StarCluster mailing list submissions to
> starcluster_at_mit.edu
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://mailman.mit.edu/mailman/listinfo/starcluster
> or, via email, send a message with subject or body 'help' to
> starcluster-request_at_mit.edu
>
> You can reach the person managing the list at
> starcluster-owner_at_mit.edu
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of StarCluster digest..."
>
> Today's Topics:
>
> 1. Re: loadbalance error (Rayson Ho)
> 2. Re: loadbalance error (Wei Tao)
> 3. Re: loadbalance error (Rayson Ho)
>
>
> ---------- Forwarded message ----------
> From: Rayson Ho <raysonlogin_at_yahoo.com>
> To: Wei Tao <wei.tao_at_tsibiocomputing.com>, "starcluster_at_mit.edu" <
> starcluster_at_mit.edu>
> Cc:
> Date: Wed, 11 Jan 2012 10:39:18 -0800 (PST)
> Subject: Re: [StarCluster] loadbalance error
> The XML parser does not like the output of "qhost -xml". (We changed some
> minor XML output code in Grid Scheduler recently,
> but as you have encountered this before in earlier versions, looks like
> our changes are not the cause of this issue.)
>
>
> I just started a 1 node cluster and let the loadbalancer add another node,
> and it all seemed to work fine...from the error message
> in your email, qhost exited with 1, and a number of things can cause qhost
> to exit with code 1.
>
>
> Can you run from the interactive shell the following command on one of the
> nodes on EC2 when you encounter this problem
> again??
>
> % qhost -xml
>
> And then send us the output. It can be an issue related to how the XML is
> generated in Grid Engine/Grid Scheduler, or it can be
> something else in the XML parser.
>
> Rayson
>
> =================================
> Open Grid Scheduler / Grid Engine
> http://gridscheduler.sourceforge.net/
>
> Scalable Grid Engine Support Program
> http://www.scalablelogic.com/
>
>
>
> ________________________________
> From: Wei Tao <wei.tao_at_tsibiocomputing.com>
> To: starcluster_at_mit.edu
> Sent: Wednesday, January 11, 2012 10:01 AM
> Subject: [StarCluster] loadbalance error
>
>
> Hi all,
>
> I was running loadbalance. After a while, I got the following error. Can
> someone shed some light on this? This happened before with earlier versions
> of Starcluster as well.
>
> >>> Loading full job history
> !!! ERROR - command 'source /etc/profile && qhost -xml' failed with status
> 1
> Traceback (most recent call last):
> File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.py",
> line 251, in main
> sc.execute(args)
> File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/commands/loadbalance.py",
> line 89, in execute
> lb.run(cluster)
> File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/balancers/sge/__init__.py",
> line 583, in run
> if self.get_stats() == -1:
> File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/balancers/sge/__init__.py",
> line 529, in get_stats
> self.stat.parse_qhost(qhostxml)
> File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/balancers/sge/__init__.py",
> line 49, in parse_qhost
> doc = xml.dom.minidom.parseString(string)
> File "/usr/lib/python2.6/xml/dom/minidom.py", line 1928, in parseString
> return expatbuilder.parseString(string)
> File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 940, in
> parseString
> return builder.parseString(string)
> File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 223, in
> parseString
> parser.Parse(string, True)
> ExpatError: syntax error: line 1, column 0
>
> ---------------------------------------------------------------------------
> MemoryError Traceback (most recent call last)
>
> /usr/local/bin/starcluster in <module>()
> 7 if __name__ == '__main__':
> 8 sys.exit(
> ----> 9 load_entry_point('StarCluster==0.93', 'console_scripts',
> 'starcluster')()
> 10 )
> 11
>
> /usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.pyc
> in main()
> 306 logger.configure_sc_logging()
> 307 warn_debug_file_moved()
> --> 308 StarClusterCLI().main()
> 309
> 310 if __name__ == '__main__':
>
> /usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.pyc
> in main(self)
> 283 log.debug(traceback.format_exc())
> 284 print
> --> 285 self.bug_found()
> 286
> 287
>
> /usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.pyc
> in bug_found(self)
> 150 crashfile = open(static.CRASH_FILE, 'w')
> 151 crashfile.write(header % "CRASH DETAILS")
> --> 152 crashfile.write(session.stream.getvalue())
> 153 crashfile.write(header % "SYSTEM INFO")
> 154 crashfile.write("StarCluster: %s\n" % __version__)
>
> /usr/lib/python2.6/StringIO.pyc in getvalue(self)
> 268 """
> 269 if self.buflist:
> --> 270 self.buf += ''.join(self.buflist)
> 271 self.buflist = []
> 272 return self.buf
>
> MemoryError:
>
> Thanks!
>
> -Wei
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
>
>
>
> ---------- Forwarded message ----------
> From: Wei Tao <wei.tao_at_tsibiocomputing.com>
> To: Rayson Ho <raysonlogin_at_yahoo.com>
> Cc: "starcluster_at_mit.edu" <starcluster_at_mit.edu>
> Date: Wed, 11 Jan 2012 15:54:52 -0500
> Subject: Re: [StarCluster] loadbalance error
> Thank you, Rayson. I will watch out if the error happens again and run the
> command that you suggested.
>
> On the other hand, I encountered another odd behavior of loadbalance. It
> seems that when loadbalance attempts to remove nodes, there is a timing gap
> between the node is marked to be removed and it's actually being
> inaccessible to job submission. In most cases it worked fine, but today a
> job was submitted to the node *after* it's marked for removal by the
> loadbalance. So the node was terminated by loadbalance, but the job was
> submitted to this node before it's killed and that job shows up on that
> node in qstat with "auo" states. When I tried to remove that node again by
> explicitly use the removenode command, it failed because the node is no
> longer there. I understand that loadbalance is still experimental. But it
> seems a good idea to tighten the timing of events so that a node is off
> limits to further job submission at the exact moment it is marked to be
> removed by loadbalance. Any gap may have unintended side effects.
>
> Thanks!
>
> -Wei
>
>
> On Wed, Jan 11, 2012 at 1:39 PM, Rayson Ho <raysonlogin_at_yahoo.com> wrote:
>
>> The XML parser does not like the output of "qhost -xml". (We changed some
>> minor XML output code in Grid Scheduler recently,
>> but as you have encountered this before in earlier versions, looks like
>> our changes are not the cause of this issue.)
>>
>>
>> I just started a 1 node cluster and let the loadbalancer add another
>> node, and it all seemed to work fine...from the error message
>> in your email, qhost exited with 1, and a number of things can cause
>> qhost to exit with code 1.
>>
>>
>> Can you run from the interactive shell the following command on one of
>> the nodes on EC2 when you encounter this problem
>> again??
>>
>> % qhost -xml
>>
>> And then send us the output. It can be an issue related to how the XML is
>> generated in Grid Engine/Grid Scheduler, or it can be
>> something else in the XML parser.
>>
>> Rayson
>>
>> =================================
>> Open Grid Scheduler / Grid Engine
>> http://gridscheduler.sourceforge.net/
>>
>> Scalable Grid Engine Support Program
>> http://www.scalablelogic.com/
>>
>>
>>
>> ________________________________
>> From: Wei Tao <wei.tao_at_tsibiocomputing.com>
>> To: starcluster_at_mit.edu
>> Sent: Wednesday, January 11, 2012 10:01 AM
>> Subject: [StarCluster] loadbalance error
>>
>>
>> Hi all,
>>
>> I was running loadbalance. After a while, I got the following error. Can
>> someone shed some light on this? This happened before with earlier versions
>> of Starcluster as well.
>>
>> >>> Loading full job history
>> !!! ERROR - command 'source /etc/profile && qhost -xml' failed with
>> status 1
>> Traceback (most recent call last):
>> File
>> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.py",
>> line 251, in main
>> sc.execute(args)
>> File
>> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/commands/loadbalance.py",
>> line 89, in execute
>> lb.run(cluster)
>> File
>> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/balancers/sge/__init__.py",
>> line 583, in run
>> if self.get_stats() == -1:
>> File
>> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/balancers/sge/__init__.py",
>> line 529, in get_stats
>> self.stat.parse_qhost(qhostxml)
>> File
>> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/balancers/sge/__init__.py",
>> line 49, in parse_qhost
>> doc = xml.dom.minidom.parseString(string)
>> File "/usr/lib/python2.6/xml/dom/minidom.py", line 1928, in parseString
>> return expatbuilder.parseString(string)
>> File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 940, in
>> parseString
>> return builder.parseString(string)
>> File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 223, in
>> parseString
>> parser.Parse(string, True)
>> ExpatError: syntax error: line 1, column 0
>>
>>
>> ---------------------------------------------------------------------------
>> MemoryError Traceback (most recent call
>> last)
>>
>> /usr/local/bin/starcluster in <module>()
>> 7 if __name__ == '__main__':
>> 8 sys.exit(
>> ----> 9 load_entry_point('StarCluster==0.93', 'console_scripts',
>> 'starcluster')()
>> 10 )
>> 11
>>
>> /usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.pyc
>> in main()
>> 306 logger.configure_sc_logging()
>> 307 warn_debug_file_moved()
>> --> 308 StarClusterCLI().main()
>> 309
>> 310 if __name__ == '__main__':
>>
>> /usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.pyc
>> in main(self)
>> 283 log.debug(traceback.format_exc())
>> 284 print
>> --> 285 self.bug_found()
>> 286
>> 287
>>
>> /usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.pyc
>> in bug_found(self)
>> 150 crashfile = open(static.CRASH_FILE, 'w')
>> 151 crashfile.write(header % "CRASH DETAILS")
>> --> 152 crashfile.write(session.stream.getvalue())
>> 153 crashfile.write(header % "SYSTEM INFO")
>> 154 crashfile.write("StarCluster: %s\n" % __version__)
>>
>> /usr/lib/python2.6/StringIO.pyc in getvalue(self)
>> 268 """
>> 269 if self.buflist:
>> --> 270 self.buf += ''.join(self.buflist)
>> 271 self.buflist = []
>> 272 return self.buf
>>
>> MemoryError:
>>
>> Thanks!
>>
>> -Wei
>> _______________________________________________
>> StarCluster mailing list
>> StarCluster_at_mit.edu
>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>
>
>
>
> --
> Wei Tao, Ph.D.
> TSI Biocomputing LLC
> 617-564-0934
>
>
> ---------- Forwarded message ----------
> From: Rayson Ho <raysonlogin_at_yahoo.com>
> To: Wei Tao <wei.tao_at_tsibiocomputing.com>
> Cc: "starcluster_at_mit.edu" <starcluster_at_mit.edu>
> Date: Wed, 11 Jan 2012 14:09:36 -0800 (PST)
> Subject: Re: [StarCluster] loadbalance error
> I think it is a synchronization issue that can be handled by disabling the
> nodes first before removing them in the load balancer, but
> there are other ways to handle this - like for example configure Grid
> Engine to rerun jobs automatically.
>
> Currently, the load balancer code checks whether a node is running jobs
> (in the is_node_working() function), and if nothing is running
> on the node, then the load balancer removes the job by calling
> remove_node().
>
> This morning was the first time I tried the load balancer, and I've only
> spent a quick 10-min look at the balancer code... others may
> have things to add and/or other suggestions.
>
> Rayson
>
> =================================
> Open Grid Scheduler / Grid Engine
> http://gridscheduler.sourceforge.net/
>
> Scalable Grid Engine Support Program
> http://www.scalablelogic.com/
>
>
>
>
> ________________________________
> From: Wei Tao <wei.tao_at_tsibiocomputing.com>
> To: Rayson Ho <raysonlogin_at_yahoo.com>
> Cc: "starcluster_at_mit.edu" <starcluster_at_mit.edu>
> Sent: Wednesday, January 11, 2012 3:54 PM
> Subject: Re: [StarCluster] loadbalance error
>
>
> Thank you, Rayson. I will watch out if the error happens again and run the
> command that you suggested.
>
> On the other hand, I encountered another odd behavior of loadbalance. It
> seems that when loadbalance attempts to remove nodes, there is a timing gap
> between the node is marked to be removed and it's actually being
> inaccessible to job submission. In most cases it worked fine, but today a
> job was submitted to the node *after* it's marked for removal by the
> loadbalance. So the node was terminated by loadbalance, but the job was
> submitted to this node before it's killed and that job shows up on that
> node in qstat with "auo" states. When I tried to remove that node again by
> explicitly use the removenode command, it failed because the node is no
> longer there. I understand that loadbalance is still experimental. But it
> seems a good idea to tighten the timing of events so that a node is off
> limits to further job submission at the exact moment it is marked to be
> removed by loadbalance. Any gap may have unintended side effects.
>
> Thanks!
>
> -Wei
>
>
>
> On Wed, Jan 11, 2012 at 1:39 PM, Rayson Ho <raysonlogin_at_yahoo.com> wrote:
>
> The XML parser does not like the output of "qhost -xml". (We changed some
> minor XML output code in Grid Scheduler recently,
> >but as you have encountered this before in earlier versions, looks like
> our changes are not the cause of this issue.)
> >
> >
> >I just started a 1 node cluster and let the loadbalancer add another
> node, and it all seemed to work fine...from the error message
> >in your email, qhost exited with 1, and a number of things can cause
> qhost to exit with code 1.
> >
> >
> >Can you run from the interactive shell the following command on one of
> the nodes on EC2 when you encounter this problem
> >again??
> >
> >% qhost -xml
> >
> >And then send us the output. It can be an issue related to how the XML is
> generated in Grid Engine/Grid Scheduler, or it can be
> >something else in the XML parser.
> >
> >Rayson
> >
> >=================================
> >Open Grid Scheduler / Grid Engine
> >http://gridscheduler.sourceforge.net/
> >
> >Scalable Grid Engine Support Program
> >http://www.scalablelogic.com/
> >
> >
> >
> >________________________________
> >From: Wei Tao <wei.tao_at_tsibiocomputing.com>
> >To: starcluster_at_mit.edu
> >Sent: Wednesday, January 11, 2012 10:01 AM
> >Subject: [StarCluster] loadbalance error
> >
> >
> >
> >Hi all,
> >
> >I was running loadbalance. After a while, I got the following error. Can
> someone shed some light on this? This happened before with earlier versions
> of Starcluster as well.
> >
> >>>> Loading full job history
> >!!! ERROR - command 'source /etc/profile && qhost -xml' failed with
> status 1
> >Traceback (most recent call last):
> > File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.py",
> line 251, in main
> > sc.execute(args)
> > File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/commands/loadbalance.py",
> line 89, in execute
> > lb.run(cluster)
> > File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/balancers/sge/__init__.py",
> line 583, in run
> > if self.get_stats() == -1:
> > File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/balancers/sge/__init__.py",
> line 529, in get_stats
> > self.stat.parse_qhost(qhostxml)
> > File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/balancers/sge/__init__.py",
> line 49, in parse_qhost
> > doc = xml.dom.minidom.parseString(string)
> > File "/usr/lib/python2.6/xml/dom/minidom.py", line 1928, in parseString
> > return expatbuilder.parseString(string)
> > File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 940, in
> parseString
> > return builder.parseString(string)
> > File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 223, in
> parseString
> > parser.Parse(string, True)
> >ExpatError: syntax error: line 1, column 0
> >
>
> >---------------------------------------------------------------------------
> >MemoryError Traceback (most recent call
> last)
> >
> >/usr/local/bin/starcluster in <module>()
> > 7 if __name__ == '__main__':
> > 8 sys.exit(
> >----> 9 load_entry_point('StarCluster==0.93', 'console_scripts',
> 'starcluster')()
> > 10 )
> > 11
> >
> >/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.pyc
> in main()
> > 306 logger.configure_sc_logging()
> > 307 warn_debug_file_moved()
> >--> 308 StarClusterCLI().main()
> > 309
> > 310 if __name__ == '__main__':
> >
> >/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.pyc
> in main(self)
> > 283 log.debug(traceback.format_exc())
> > 284 print
> >--> 285 self.bug_found()
> > 286
> > 287
> >
> >/usr/local/lib/python2.6/dist-packages/StarCluster-0.93-py2.6.egg/starcluster/cli.pyc
> in bug_found(self)
> > 150 crashfile = open(static.CRASH_FILE, 'w')
> > 151 crashfile.write(header % "CRASH DETAILS")
> >--> 152 crashfile.write(session.stream.getvalue())
> > 153 crashfile.write(header % "SYSTEM INFO")
> > 154 crashfile.write("StarCluster: %s\n" % __version__)
> >
> >/usr/lib/python2.6/StringIO.pyc in getvalue(self)
> > 268 """
> > 269 if self.buflist:
> >--> 270 self.buf += ''.join(self.buflist)
> > 271 self.buflist = []
> > 272 return self.buf
> >
> >MemoryError:
> >
> >Thanks!
> >
> >-Wei
> >_______________________________________________
> >StarCluster mailing list
> >StarCluster_at_mit.edu
> >http://mailman.mit.edu/mailman/listinfo/starcluster
> >
>
>
> --
> Wei Tao, Ph.D.
> TSI Biocomputing LLC
> 617-564-0934
>
>
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
>
Received on Wed Jan 11 2012 - 18:27:40 EST
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject