StarCluster - Mailing List Archive

Load Balancer issues

From: Avner May <no email>
Date: Tue, 2 Jun 2015 15:06:00 -0400

Hi all,

I was writing because I have been having a lot of issues with the load
balancer. The most common issue I have is that it fails to remove
instances effectively. In a super slow fashion, it goes through the
instances it wants to terminate (this pace is frustrating independent of
the failure/success of the operation), and one by one fails to terminate
each one. Then, I am forced to kill a subset of the nodes in my cluster
manually. But this results in the scheduler being confused by how many
nodes are actually in the network, so when I later submit jobs to the
cluster again, it thinks it has enough nodes to handle that load, and
doesn't create new instances. So I am forced to create a ton of dummy jobs
(eg, "qsub -V -b y -cwd hostname"), to trick the scheduler into thinking
that it has more queued jobs than "available" machines. These issues are
quite annoying.

Additionally, just now I had an issue where the load balancer failed to
launch a machine:

!!! ERROR - Error occured while running plugin
'starcluster.clustersetup.DefaultClusterSetup':
!!! ERROR - Failed to add new host
Traceback (most recent call last):
  File
"C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\balancers\sge\__init__.py",
line 719, in _eval_add_node
    self._cluster.add_nodes(need_to_add)
  File
"C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\cluster.py",
line 1042, in add_nodes
    self.run_plugins(method_name="on_add_node", node=node)
  File
"C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\cluster.py",
line 1690, in run_plugins
    self.run_plugin(plug, method_name=method_name, node=node)
  File
"C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\cluster.py",
line 1715, in run_plugin
    func(*args)
  File
"C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\clustersetup.py",
line 425, in on_add_node
    self._setup_etc_hosts(nodes)
  File
"C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\clustersetup.py",
line 252, in _setup_etc_hosts
    self.pool.wait(numtasks=len(nodes))
  File
"C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\threadpool.py",
line 177, in wait
    "An error occurred in ThreadPool", excs)
ThreadPoolException: An error occurred in ThreadPool
>>> Sleeping...(looping again in 60 secs)

After getting this error, for some reason the load balancer stopped
recognizing the existance of the cluster:

C:\Windows\system32>starcluster loadbalance --max_nodes=100 --min_nodes=1
--add_nodes_per_iter=17 babel2
StarCluster - (http://star.mit.edu/cluster) (v. 0.95.6)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster_at_mit.edu

!!! ERROR - cluster babel2 is not running

Is anyone else hitting similar issues with the load balancer?

Thanks,
Avner
Received on Tue Jun 02 2015 - 15:06:23 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject