Hi Amanda,
It looks like you cannot communicate with the master node anymore. The
error message is because starcluster failed to execute a simple 'source
/etc/profile/' command with a 'connection refused' error.
Can you paste us the output of the following two commands:
> starcluster listclusters (should list status of all your active clusters
and running nodes)
> starcluster sshmaster <your cluster name> (i'm expecting this to fail)
Raj
On Mon, Sep 22, 2014 at 5:13 PM, Amanda Joy Kedaigle <mandyjoy_at_mit.edu>
wrote:
> Hi,
>
> I am trying to run starcluster's loadbalancer to keep only one node
> running until jobs are submitted to the cluster. I know it's an
> experimental feature, but I'm wondering if anyone has run into this error
> before, or has any suggestions. The cluster has been whittled down to 1
> node after a weekend of inactivity, and now it seems that when jobs are
> submitted to the queue, instead of adding nodes, SGE fails.
>
> >>> Loading full job history
> *** WARNING - Failed to retrieve stats (1/5):
> Traceback (most recent call last):
> File
> "/net/dorsal/apps/python2.7/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/balancers/sge/__init__.py",
> line 552, in get_stats
> return self._get_stats()
> File
> "/net/dorsal/apps/python2.7/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/balancers/sge/__init__.py",
> line 522, in _get_stats
> qhostxml = '\n'.join(master.ssh.execute('qhost -xml'))
> File
> "/net/dorsal/apps/python2.7/lib/python2.7/site-packages/StarCluster-0.95.5-py2.7.egg/starcluster/sshutils.py",
> line 578, in execute
> msg, command, exit_status, out_str)
> RemoteCommandFailed: remote command 'source /etc/profile && qhost -xml'
> failed with status 1:
> error: commlib error: got select error (Connection refused)
> error: unable to send message to qmaster using port 63231 on host
> "master": got send error
>
> Thanks for any help!
> Amanda
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
>
Received on Tue Sep 23 2014 - 09:33:55 EDT
This archive was generated by
hypermail 2.3.0.