StarCluster - Mailing List Archive

set_keepalive

From: David Stuebe <no email>
Date: Fri, 25 Apr 2014 15:19:16 +0000

Hi Starcluster

I have been building some plugins that take a while to run because they are building and installing large libraries. As a result I have seen issues with ssh terminating my connection while the process is still running. Which seems to return exit code 1, although the process continues on the cluster.

For my custom plugins I added the following line to apply a keep alive to the ssh transport.

def run(self, nodes, master, user, user_shell, volumes):

        for node in nodes:
            node.ssh.transport.set_keepalive(30)


This can be done this way, but you might consider adding it somewhere in starcluster, probably in the connect method of SSHClient:
https://github.com/jtriley/StarCluster/blob/develop/starcluster/sshutils.py#L100


Here are the methods form paramiko:
https://github.com/paramiko/paramiko/blob/master/paramiko/packet.py#L175
https://github.com/paramiko/paramiko/blob/master/paramiko/transport.py#L762


Another step that would help is to add a longer disconnect to the default /etc/ssh/sshd_config in the cluster ami.

For instance I have used one of my plugins to set:
ClientAliveInterval 600
ClientAliveCountMax 3

That should keep ssh connections open for half an hour.

David Stuebe
Scientist & Software Engineer

55 Village Square Drive
South Kingstown, RI 02879-8248

Tel: +1 (401) 789-6224
Email: David.Stuebe_at_rpsgroup.com<mailto:David.Stuebe_at_rpsgroup.com>
www: asascience.com<http://www.asascience.com/> | rpsgroup.com<http://www.rpsgroup.com/>

A member of the RPS Group plc
Received on Fri Apr 25 2014 - 11:19:30 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject