The log line you cited:
Traceback (most recent call last):
File
"C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\balancers\sge\__init__.py",
line 719, in _eval_add_node
has this, which is puzzling:
log.info("No queued jobs older than %d seconds" % self
.longest_allowed_queue_time)
https://github.com/jtriley/StarCluster/blob/develop/starcluster/balancers/sge/__init__.py
Three questions -
1) Are you using an up-to-date version?
2) did you try to override wait_time aka longest_allowed_queue_time in your
config file or on the load balancer command line? Otherwise it makes very
little sense, your stack trace looks like add_node failed, not the load
balancer
3) Any plugins running?
On Tue, Jun 2, 2015 at 3:06 PM, Avner May <avnermay_at_cs.columbia.edu> wrote:
> Hi all,
>
> I was writing because I have been having a lot of issues with the load
> balancer. The most common issue I have is that it fails to remove
> instances effectively. In a super slow fashion, it goes through the
> instances it wants to terminate (this pace is frustrating independent of
> the failure/success of the operation), and one by one fails to terminate
> each one. Then, I am forced to kill a subset of the nodes in my cluster
> manually. But this results in the scheduler being confused by how many
> nodes are actually in the network, so when I later submit jobs to the
> cluster again, it thinks it has enough nodes to handle that load, and
> doesn't create new instances. So I am forced to create a ton of dummy jobs
> (eg, "qsub -V -b y -cwd hostname"), to trick the scheduler into thinking
> that it has more queued jobs than "available" machines. These issues are
> quite annoying.
>
> Additionally, just now I had an issue where the load balancer failed to
> launch a machine:
>
> !!! ERROR - Error occured while running plugin
> 'starcluster.clustersetup.DefaultClusterSetup':
> !!! ERROR - Failed to add new host
> Traceback (most recent call last):
> File
> "C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\balancers\sge\__init__.py",
> line 719, in _eval_add_node
> self._cluster.add_nodes(need_to_add)
> File
> "C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\cluster.py",
> line 1042, in add_nodes
> self.run_plugins(method_name="on_add_node", node=node)
> File
> "C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\cluster.py",
> line 1690, in run_plugins
> self.run_plugin(plug, method_name=method_name, node=node)
> File
> "C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\cluster.py",
> line 1715, in run_plugin
> func(*args)
> File
> "C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\clustersetup.py",
> line 425, in on_add_node
> self._setup_etc_hosts(nodes)
> File
> "C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\clustersetup.py",
> line 252, in _setup_etc_hosts
> self.pool.wait(numtasks=len(nodes))
> File
> "C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\threadpool.py",
> line 177, in wait
> "An error occurred in ThreadPool", excs)
> ThreadPoolException: An error occurred in ThreadPool
> >>> Sleeping...(looping again in 60 secs)
>
> After getting this error, for some reason the load balancer stopped
> recognizing the existance of the cluster:
>
> C:\Windows\system32>starcluster loadbalance --max_nodes=100 --min_nodes=1
> --add_nodes_per_iter=17 babel2
> StarCluster - (http://star.mit.edu/cluster) (v. 0.95.6)
> Software Tools for Academics and Researchers (STAR)
> Please submit bug reports to starcluster_at_mit.edu
>
> !!! ERROR - cluster babel2 is not running
>
> Is anyone else hitting similar issues with the load balancer?
>
> Thanks,
> Avner
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
>
Received on Tue Jun 02 2015 - 15:25:10 EDT
This archive was generated by
hypermail 2.3.0.