Actually, on looking closer into it, this may not be a bug in StarCluster
per se-- I think one of my nodes may have crashed and when StarCluster was
summing up the #processors over the nodes, it failed to ssh to that node.
Dan
On Thu, Feb 21, 2013 at 12:59 PM, Daniel Povey <dpovey_at_gmail.com> wrote:
> I attach a crash report.
> I think this may be an error in mapping a node name to an internet name.
> The host ec2-23-22-72-123.compute-1.amazonaws.com was not actually
> node004 which I was trying to remove, it was node003.
> Do you think the github version will be better than the released version
> at the moment? I do have the latest release.
> Dan
>
>
> >>> Removing node004 from SGE
> !!! ERROR - command 'source /etc/profile && qconf -de node004' failed with
> status 1
> !!! ERROR - command 'pkill -9 sge_execd' failed with status 1
> >>> Updating SGE parallel environment 'orte'
> 4/4 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> 100%
> error occurred in job (id=139906233857792): failed to connect to host
> ec2-23-22-72-123.compute-1.amazonaws.com on port 22
> Traceback (most recent call last):
> File
> "/opt/lib/python2.6/site-packages/StarCluster-0.93.3-py2.6.egg/starcluster/threadpool.py",
> line 31, in run
> job.run()
> File
> "/opt/lib/python2.6/site-packages/StarCluster-0.93.3-py2.6.egg/starcluster/threadpool.py",
> line 58, in run
> r = self.method(*self.args, **self.kwargs)
> File
> "/opt/lib/python2.6/site-packages/StarCluster-0.93.3-py2.6.egg/starcluster/plugins/sge.py",
> line 50, in <lambda>
> num_processors = sum(self.pool.map(lambda n: n.num_processors, nodes))
> File
> "/opt/lib/python2.6/site-packages/StarCluster-0.93.3-py2.6.egg/starcluster/node.py",
> line 169, in num_processors
> 'cat /proc/cpuinfo | grep processor | wc -l')[0])
> File
> "/opt/lib/python2.6/site-packages/StarCluster-0.93.3-py2.6.egg/starcluster/sshutils/__init__.py",
> line 519, in execute
> channel = self.transport.open_session()
> File
> "/opt/lib/python2.6/site-packages/StarCluster-0.93.3-py2.6.egg/starcluster/sshutils/__init__.py",
> line 136, in transport
> port=self._port, timeout=self._timeout)
> File
> "/opt/lib/python2.6/site-packages/StarCluster-0.93.3-py2.6.egg/starcluster/sshutils/__init__.py",
> line 103, in connect
> raise exception.SSHConnectionError(host, port)
> SSHConnectionError: failed to connect to host
> ec2-23-22-72-123.compute-1.amazonaws.com on port 22
>
>
Received on Thu Feb 21 2013 - 13:25:20 EST
This archive was generated by
hypermail 2.3.0.