StarCluster - Mailing List Archive

commlib error

From: Amanda Joy Kedaigle <no email>
Date: Mon, 22 Sep 2014 21:13:13 +0000

Hi,

I am trying to run starcluster's loadbalancer to keep only one node running until jobs are submitted to the cluster. I know it's an experimental feature, but I'm wondering if anyone has run into this error before, or has any suggestions. The cluster has been whittled down to 1 node after a weekend of inactivity, and now it seems that when jobs are submitted to the queue, instead of adding nodes, SGE fails.

>>> Loading full job history
*** WARNING - Failed to retrieve stats (1/5):
Traceback (most recent call last):
  File "/net/dorsal/apps/python2.7/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/balancers/sge/__init__.py", line 552, in get_stats
    return self._get_stats()
  File "/net/dorsal/apps/python2.7/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/balancers/sge/__init__.py", line 522, in _get_stats
    qhostxml = '\n'.join(master.ssh.execute('qhost -xml'))
  File "/net/dorsal/apps/python2.7/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/sshutils.py", line 578, in execute
    msg, command, exit_status, out_str)
RemoteCommandFailed: remote command 'source /etc/profile && qhost -xml' failed with status 1:
error: commlib error: got select error (Connection refused)
error: unable to send message to qmaster using port 63231 on host "master": got send error

Thanks for any help!
Amanda
Received on Mon Sep 22 2014 - 17:13:38 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject