Hi Starcluster
I have been building some plugins that take a while to run because they are building and installing large libraries. As a result I have seen issues with ssh terminating my connection while the process is still running. Which seems to return exit code 1, although the process continues on the cluster.
For my custom plugins I added the following line to apply a keep alive to the ssh transport.
def run(self, nodes, master, user, user_shell, volumes):
…
for node in nodes:
node.ssh.transport.set_keepalive(30)
…
This can be done this way, but you might consider adding it somewhere in starcluster, probably in the connect method of SSHClient:
https://github.com/jtriley/StarCluster/blob/develop/starcluster/sshutils.py#L100
Here are the methods form paramiko:
https://github.com/paramiko/paramiko/blob/master/paramiko/packet.py#L175
https://github.com/paramiko/paramiko/blob/master/paramiko/transport.py#L762
Another step that would help is to add a longer disconnect to the default /etc/ssh/sshd_config in the cluster ami.
For instance I have used one of my plugins to set:
ClientAliveInterval 600
ClientAliveCountMax 3
That should keep ssh connections open for half an hour.
David Stuebe
Scientist & Software Engineer
55 Village Square Drive
South Kingstown, RI 02879-8248
Tel: +1 (401) 789-6224
Email: David.Stuebe_at_rpsgroup.com<mailto:David.Stuebe_at_rpsgroup.com>
www: asascience.com<
http://www.asascience.com/> | rpsgroup.com<
http://www.rpsgroup.com/>
A member of the RPS Group plc
Received on Fri Apr 25 2014 - 11:19:30 EDT
This archive was generated by
hypermail 2.3.0.