Sorry, I haven't had time to dig deeply into this. Ironically, it just
occurred on my cluster too. The load balancer kept rolling through the
errors and only lost one node.
>From a superficial analysis, starcluster's add_node code spawns a thread
for each new host, to setup the /etc/hosts file. If one fails with an
exception, the thread is joined and the exception dumped out somewhere
though not where i'd expect.
In my stack trace but not yours, it had this paramiko error:
>>> Configuring /etc/hosts on each node
No handlers could be found for logger "paramiko.transport" |
0%
3/3 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
100%
!!! ERROR - Error occured while running plugin
'starcluster.clustersetup.DefaultClusterSetup':
!!! ERROR - Failed to add new host
Traceback (most recent call last):
File
"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/StarCluster-0.95.6-py2.7.egg/starcluster/balancers/sge/__init__.py",
line 719, in _eval_add_node
self._cluster.add_nodes(need_to_add)
File
"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/StarCluster-0.95.6-py2.7.egg/starcluster/cluster.py",
line 1042, in add_nodes
self.run_plugins(method_name="on_add_node", node=node)
File
"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/StarCluster-0.95.6-py2.7.egg/starcluster/cluster.py",
line 1690, in run_plugins
self.run_plugin(plug, method_name=method_name, node=node)
File
"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/StarCluster-0.95.6-py2.7.egg/starcluster/cluster.py",
line 1715, in run_plugin
func(*args)
File
"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/StarCluster-0.95.6-py2.7.egg/starcluster/clustersetup.py",
line 425, in on_add_node
self._setup_etc_hosts(nodes)
File
"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/StarCluster-0.95.6-py2.7.egg/starcluster/clustersetup.py",
line 252, in _setup_etc_hosts
self.pool.wait(numtasks=len(nodes))
File
"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/StarCluster-0.95.6-py2.7.egg/starcluster/threadpool.py",
line 177, in wait
"An error occurred in ThreadPool", excs)
ThreadPoolException: An error occurred in ThreadPool
>>> Sleeping...(looping again in 60 secs)
And stack overflow has a simple idea of how to solve that:
http://stackoverflow.com/questions/19152578/no-handlers-could-be-found-for-logger-paramiko
It hasn't recurred.
As for your stack trace logs, I see that you're running windows and I think
I'll never be able to solve your problems.
On Wed, Jun 3, 2015 at 9:19 AM, Avner May <avnermay_at_cs.columbia.edu> wrote:
> Attached are 2 more logs of load balancer crashes.
>
> On Tue, Jun 2, 2015 at 4:27 PM, Avner May <avnermay_at_cs.columbia.edu>
> wrote:
>
>> 1) I am using StarCluster version 0.95.6
>> C:\Windows\system32>starcluster addnode mycluster
>> StarCluster - (http://star.mit.edu/cluster) (*v. 0.95.6*)
>> Software Tools for Academics and Researchers (STAR)
>> Please submit bug reports to starcluster_at_mit.edu
>>
>> 2) I did not try to override wait_time
>>
>> 3) SGE plugin is running
>>
>> And this particular failure occurred when the load balancer was trying to
>> add nodes.
>>
>> Thanks,
>> Avner
>>
>> On Tue, Jun 2, 2015 at 3:24 PM, Rajat Banerjee <rajatb_at_post.harvard.edu>
>> wrote:
>>
>>> The log line you cited:
>>> Traceback (most recent call last):
>>> File
>>> "C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\balancers\sge\__init__.py",
>>> line 719, in _eval_add_node
>>>
>>>
>>> has this, which is puzzling:
>>> log.info("No queued jobs older than %d seconds" % self
>>> .longest_allowed_queue_time)
>>>
>>> https://github.com/jtriley/StarCluster/blob/develop/starcluster/balancers/sge/__init__.py
>>>
>>> Three questions -
>>> 1) Are you using an up-to-date version?
>>> 2) did you try to override wait_time aka longest_allowed_queue_time in
>>> your config file or on the load balancer command line? Otherwise it makes
>>> very little sense, your stack trace looks like add_node failed, not the
>>> load balancer
>>> 3) Any plugins running?
>>>
>>> On Tue, Jun 2, 2015 at 3:06 PM, Avner May <avnermay_at_cs.columbia.edu>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I was writing because I have been having a lot of issues with the load
>>>> balancer. The most common issue I have is that it fails to remove
>>>> instances effectively. In a super slow fashion, it goes through the
>>>> instances it wants to terminate (this pace is frustrating independent of
>>>> the failure/success of the operation), and one by one fails to terminate
>>>> each one. Then, I am forced to kill a subset of the nodes in my cluster
>>>> manually. But this results in the scheduler being confused by how many
>>>> nodes are actually in the network, so when I later submit jobs to the
>>>> cluster again, it thinks it has enough nodes to handle that load, and
>>>> doesn't create new instances. So I am forced to create a ton of dummy jobs
>>>> (eg, "qsub -V -b y -cwd hostname"), to trick the scheduler into
>>>> thinking that it has more queued jobs than "available" machines. These
>>>> issues are quite annoying.
>>>>
>>>> Additionally, just now I had an issue where the load balancer failed to
>>>> launch a machine:
>>>>
>>>> !!! ERROR - Error occured while running plugin
>>>> 'starcluster.clustersetup.DefaultClusterSetup':
>>>> !!! ERROR - Failed to add new host
>>>> Traceback (most recent call last):
>>>> File
>>>> "C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\balancers\sge\__init__.py",
>>>> line 719, in _eval_add_node
>>>> self._cluster.add_nodes(need_to_add)
>>>> File
>>>> "C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\cluster.py",
>>>> line 1042, in add_nodes
>>>> self.run_plugins(method_name="on_add_node", node=node)
>>>> File
>>>> "C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\cluster.py",
>>>> line 1690, in run_plugins
>>>> self.run_plugin(plug, method_name=method_name, node=node)
>>>> File
>>>> "C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\cluster.py",
>>>> line 1715, in run_plugin
>>>> func(*args)
>>>> File
>>>> "C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\clustersetup.py",
>>>> line 425, in on_add_node
>>>> self._setup_etc_hosts(nodes)
>>>> File
>>>> "C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\clustersetup.py",
>>>> line 252, in _setup_etc_hosts
>>>> self.pool.wait(numtasks=len(nodes))
>>>> File
>>>> "C:\Python27\lib\site-packages\starcluster-0.95.6-py2.7.egg\starcluster\threadpool.py",
>>>> line 177, in wait
>>>> "An error occurred in ThreadPool", excs)
>>>> ThreadPoolException: An error occurred in ThreadPool
>>>> >>> Sleeping...(looping again in 60 secs)
>>>>
>>>> After getting this error, for some reason the load balancer stopped
>>>> recognizing the existance of the cluster:
>>>>
>>>> C:\Windows\system32>starcluster loadbalance --max_nodes=100
>>>> --min_nodes=1 --add_nodes_per_iter=17 babel2
>>>> StarCluster - (http://star.mit.edu/cluster) (v. 0.95.6)
>>>> Software Tools for Academics and Researchers (STAR)
>>>> Please submit bug reports to starcluster_at_mit.edu
>>>>
>>>> !!! ERROR - cluster babel2 is not running
>>>>
>>>> Is anyone else hitting similar issues with the load balancer?
>>>>
>>>> Thanks,
>>>> Avner
>>>>
>>>> _______________________________________________
>>>> StarCluster mailing list
>>>> StarCluster_at_mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>>
>>>>
>>>
>>
>
Received on Thu Jun 04 2015 - 19:06:12 EDT