StarCluster - Mailing List Archive

Re: Loadbalancer error - node does not exist

From: Jacob barhak <no email>
Date: Sat, 4 Oct 2014 22:40:30 -0500

Hi Amanda,

Did you check your I/O when running your application?

In the past I had too much traffic to the NFS due to many files and it caused things to slow down considerably.

It may not be your issue, yet it is worth checking anyway.

I hope you resolve your issue regardless.

         Jacob

Sent from my iPhone

On Oct 4, 2014, at 1:07 PM, Amanda Joy Kedaigle <mandyjoy_at_mit.edu> wrote:

> Update: It seems like it might start happening whenever the cluster gets up to maximum capacity, which is 16 nodes. Any ideas of what to look for would be appreciated, this is getting expensive.
>
> Amanda
>
> From: Amanda Joy Kedaigle
> Sent: Thursday, October 02, 2014 12:36 PM
> To: starcluster_at_mit.edu
> Subject: Loadbalancer error - node does not exist
>
> Hi, I'm running the Elastic LoadBalancer to keep our cluster down to one node when we're not using it, and then ramp up as needed. Generally (i.e. when I run tests and watch it), it works just fine. But twice now, we've had it fail to remove nodes overnight and give the following error, leaving the cluster at full blast with no jobs to run. It says the nodes don't exist, but they are there both on the AWS EC2 console and when I run qhost on the cluster. Any ideas as to the cause? Thanks!
>
> >>> Removing node013 from SGE
>
> !!! ERROR - Error occured while running plugin 'starcluster.plugins.sge.SGEPlugin':
>
> !!! ERROR - Failed to remove node node013
>
> Traceback (most recent call last):
>
> File "/home/mandyjoy/ENV/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/balancers/sge/__init__.py", line 754, in _eval_remove_node
>
> self._cluster.remove_node(node)
>
> File "/home/mandyjoy/ENV/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/cluster.py", line 1050, in remove_node
>
> force=force)
>
> File "/home/mandyjoy/ENV/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/cluster.py", line 1076, in remove_nodes
>
> reverse=True)
>
> File "/home/mandyjoy/ENV/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/cluster.py", line 1690, in run_plugins
>
> self.run_plugin(plug, method_name=method_name, node=node)
>
> File "/home/mandyjoy/ENV/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/cluster.py", line 1715, in run_plugin
>
> func(*args)
>
> File "/home/mandyjoy/ENV/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/plugins/sge.py", line 204, in on_remove_node
>
> self._remove_from_sge(node)
>
> File "/home/mandyjoy/ENV/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/plugins/sge.py", line 166, in _remove_from_sge
>
> master.ssh.execute('qconf -de %s' % node.alias)
>
> File "/home/mandyjoy/ENV/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/sshutils.py", line 579, in execute
>
> msg, command, exit_status, out_str)
>
> RemoteCommandFailed: remote command 'source /etc/profile && qconf -de node013' failed with status 1:
>
> denied: execution host "node013" does not exist
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
Received on Sat Oct 04 2014 - 23:40:42 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject