StarCluster - Mailing List Archive

Re: "connection closed" when running "starcluster ssmaster"

From: Rayson Ho <no email>
Date: Fri, 17 Jan 2014 13:18:44 -0500

On Fri, Jan 17, 2014 at 11:11 AM, Signell, Richard <rsignell_at_usgs.gov> wrote:
> Rayson,
>
> Okay, I tried your suggestion:
>
> 1. I can ssh to node001 just fine.
>

That's good, so it means that the private key on your side works. And
on Amazon's side, the slave node also has a working public key.


> 2. I cannot ssh to master:
>
...
>
> Does that give a clue?

So it seems like the master node doesn't like your key. You can try:

- First SSH into node001, and then run "ssh master".

- And if the above still fails, try to use the Grid Engine mechanism
to get back into the master node. So, first SSH into node001. Then run
"qrsh -l h=master":

root_at_node001:~# qrsh -l h=master
groups: cannot find name for group ID 20003
root_at_master:~#

(Ignore the complain about the extra Group ID (GID), it is an
additional GID added by Grid Engine.


> [also, is it okay to have this type of discussion on this list]?

As long as it is StarCluster, Amazon EC2, and Grid Engine related,
then I am happy to help! :-D

Rayson

==================================================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/
http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html


>
> -Rich
>
> On Fri, Jan 17, 2014 at 10:56 AM, Rayson Ho <raysonlogin_at_gmail.com> wrote:
>> Did you overwrite your SSH private key
>> (/home/rsignell/.ssh/mykey2.rsa) with a new one?
>>
>> Also, can you run the SSH client directly from the command line with
>> verbose (-v) on and see if that gives you anything?
>>
>> Example:
>>
>> % ssh -v -i /home/rsignell/.ssh/mykey2.rsa
>> root_at_ec2-54-196-2-68.compute-1.amazonaws.com
>>
>> Rayson
>>
>> ==================================================
>> Open Grid Scheduler - The Official Open Source Grid Engine
>> http://gridscheduler.sourceforge.net/
>> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>>
>>
>> On Fri, Jan 17, 2014 at 8:03 AM, Signell, Richard <rsignell_at_usgs.gov> wrote:
>>> Rayson,
>>>
>>> I tried ssh'ing into node001 as you suggested, but the process just
>>> seems to hang. I waited 5 minutes, tried to ctrl-c, ctrl-z, nothing
>>> worked. Finally killed terminal.
>>>
>>> What should I try next?
>>>
>>> rsignell_at_gam:~$ starcluster -d sshnode rps_cluster node001
>>> StarCluster - (http://star.mit.edu/cluster) (v. 0.94.3)
>>> Software Tools for Academics and Researchers (STAR)
>>> Please submit bug reports to starcluster_at_mit.edu
>>>
>>> 2014-01-17 07:59:34,900 config.py:567 - DEBUG - Loading config
>>> 2014-01-17 07:59:34,900 config.py:138 - DEBUG - Loading file:
>>> /home/rsignell/.starcluster/config
>>> ...
>>>
>>> 2014-01-17 07:59:34,935 awsutils.py:74 - DEBUG - creating self._conn
>>> w/ connection_authenticator kwargs = {'proxy_user': None,
>>> 'proxy_pass': None, 'proxy_port': None, 'proxy': None, 'is_secure':
>>> True, 'path': '/', 'region': None, 'validate_certs': True, 'port':
>>> None}
>>> 2014-01-17 07:59:35,797 cluster.py:711 - DEBUG - existing nodes: {}
>>> 2014-01-17 07:59:35,797 cluster.py:719 - DEBUG - adding node
>>> i-7a50f654 to self._nodes list
>>> 2014-01-17 07:59:35,797 cluster.py:719 - DEBUG - adding node
>>> i-7950f657 to self._nodes list
>>> 2014-01-17 07:59:35,798 cluster.py:727 - DEBUG - returning self._nodes
>>> = [<Node: master (i-7950f657)>, <Node: node001 (i-7a50f654)>]
>>> 2014-01-17 07:59:35,905 cluster.py:711 - DEBUG - existing nodes:
>>> {u'i-7a50f654': <Node: node001 (i-7a50f654)>, u'i-7950f657': <Node:
>>> master (i-7950f657)>}
>>> 2014-01-17 07:59:35,906 cluster.py:714 - DEBUG - updating existing
>>> node i-7a50f654 in self._nodes
>>> 2014-01-17 07:59:35,906 cluster.py:714 - DEBUG - updating existing
>>> node i-7950f657 in self._nodes
>>> 2014-01-17 07:59:35,906 cluster.py:727 - DEBUG - returning self._nodes
>>> = [<Node: master (i-7950f657)>, <Node: node001 (i-7a50f654)>]
>>> 2014-01-17 07:59:36,119 node.py:1039 - DEBUG - Using native OpenSSH client
>>> 2014-01-17 07:59:36,119 node.py:1050 - DEBUG - ssh_cmd: ssh -i
>>> /home/rsignell/.ssh/mykey2.rsa
>>> root_at_ec2-54-196-2-68.compute-1.amazonaws.com
>>> [wait, wait.... nothing....]
>>>
>>>
>>> On Thu, Jan 16, 2014 at 5:24 PM, Rayson Ho <raysonlogin_at_gmail.com> wrote:
>>>> The SSH daemon is responding (and the EC2 security group is not
>>>> blocking traffic), which is good.
>>>>
>>>> However, logging onto the master was working a few hours ago and not
>>>> anymore, then try to log onto the Grid Engine execution node by using,
>>>> for example, "starcluster sshnode rps_cluster node001". If SSHing into
>>>> the execution node works, then it is likely to be an issue with the
>>>> StarCluster master instance.
>>>>
>>>> Rayson
>>>>
>>>> ==================================================
>>>> Open Grid Scheduler - The Official Open Source Grid Engine
>>>> http://gridscheduler.sourceforge.net/
>>>> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>>>>
>>>>
>>>> On Thu, Jan 16, 2014 at 4:55 PM, Signell, Richard <rsignell_at_usgs.gov> wrote:
>>>>> I set up a machine this morning and
>>>>> starcluster sshmaster rps_cluster
>>>>> was working fine to ssh in.
>>>>>
>>>>> But now I'm getting "Connection closed by 54.204.55.67"
>>>>>
>>>>> It seem that the cluster is running:
>>>>>
>>>>> rsignell_at_gam:~$ starcluster listclusters
>>>>> StarCluster - (http://star.mit.edu/cluster) (v. 0.94.3)
>>>>> Software Tools for Academics and Researchers (STAR)
>>>>> Please submit bug reports to starcluster_at_mit.edu
>>>>>
>>>>> ---------------------------------------------
>>>>> rps_cluster (security group: _at_sc-rps_cluster)
>>>>> ---------------------------------------------
>>>>> Launch time: 2014-01-16 08:18:09
>>>>> Uptime: 0 days, 08:34:07
>>>>> Zone: us-east-1a
>>>>> Keypair: mykey2
>>>>> EBS volumes: N/A
>>>>> Cluster nodes:
>>>>> master running i-7950f657 ec2-54-204-55-67.compute-1.amazonaws.com
>>>>> node001 running i-7a50f654 ec2-54-196-2-68.compute-1.amazonaws.com
>>>>> Total nodes: 2
>>>>>
>>>>> And I don't see anything obvious in the verbose debug output:
>>>>>
>>>>> rsignell_at_gam:~$ starcluster -d sshmaster rps_cluster
>>>>> StarCluster - (http://star.mit.edu/cluster) (v. 0.94.3)
>>>>> Software Tools for Academics and Researchers (STAR)
>>>>> Please submit bug reports to starcluster_at_mit.edu
>>>>>
>>>>> 2014-01-16 16:53:13,515 config.py:567 - DEBUG - Loading config
>>>>> 2014-01-16 16:53:13,515 config.py:138 - DEBUG - Loading file:
>>>>> /home/rsignell/.starcluster/config
>>>>> 2014-01-16 16:53:13,517 config.py:322 - DEBUG - include setting not
>>>>> specified. Defaulting to []
>>>>> 2014-01-16 16:53:13,518 config.py:322 - DEBUG - web_browser setting
>>>>> not specified. Defaulting to None
>>>>> 2014-01-16 16:53:13,518 config.py:322 - DEBUG - refresh_interval
>>>>> setting not specified. Defaulting to 30
>>>>> 2014-01-16 16:53:13,518 config.py:322 - DEBUG - include setting not
>>>>> specified. Defaulting to []
>>>>> 2014-01-16 16:53:13,518 config.py:322 - DEBUG - web_browser setting
>>>>> not specified. Defaulting to None
>>>>> 2014-01-16 16:53:13,519 config.py:322 - DEBUG - refresh_interval
>>>>> setting not specified. Defaulting to 30
>>>>> 2014-01-16 16:53:13,519 config.py:322 - DEBUG - aws_proxy_pass setting
>>>>> not specified. Defaulting to None
>>>>> 2014-01-16 16:53:13,519 config.py:322 - DEBUG - aws_validate_certs
>>>>> setting not specified. Defaulting to True
>>>>> 2014-01-16 16:53:13,520 config.py:322 - DEBUG - aws_ec2_path setting
>>>>> not specified. Defaulting to /
>>>>> 2014-01-16 16:53:13,520 config.py:322 - DEBUG - aws_region_name
>>>>> setting not specified. Defaulting to None
>>>>> 2014-01-16 16:53:13,521 config.py:322 - DEBUG - aws_region_host
>>>>> setting not specified. Defaulting to None
>>>>> 2014-01-16 16:53:13,521 config.py:322 - DEBUG - aws_s3_path setting
>>>>> not specified. Defaulting to /
>>>>> 2014-01-16 16:53:13,521 config.py:322 - DEBUG - aws_proxy_user setting
>>>>> not specified. Defaulting to None
>>>>> 2014-01-16 16:53:13,521 config.py:322 - DEBUG - aws_is_secure setting
>>>>> not specified. Defaulting to True
>>>>> 2014-01-16 16:53:13,522 config.py:322 - DEBUG - aws_s3_host setting
>>>>> not specified. Defaulting to None
>>>>> 2014-01-16 16:53:13,522 config.py:322 - DEBUG - aws_port setting not
>>>>> specified. Defaulting to None
>>>>> 2014-01-16 16:53:13,522 config.py:322 - DEBUG - ec2_private_key
>>>>> setting not specified. Defaulting to None
>>>>> 2014-01-16 16:53:13,522 config.py:322 - DEBUG - ec2_cert setting not
>>>>> specified. Defaulting to None
>>>>> 2014-01-16 16:53:13,523 config.py:322 - DEBUG - aws_proxy setting not
>>>>> specified. Defaulting to None
>>>>> 2014-01-16 16:53:13,523 config.py:322 - DEBUG - aws_proxy_port setting
>>>>> not specified. Defaulting to None
>>>>> 2014-01-16 16:53:13,523 config.py:322 - DEBUG - device setting not
>>>>> specified. Defaulting to None
>>>>> 2014-01-16 16:53:13,523 config.py:322 - DEBUG - partition setting not
>>>>> specified. Defaulting to None
>>>>> 2014-01-16 16:53:13,524 config.py:322 - DEBUG - device setting not
>>>>> specified. Defaulting to None
>>>>> 2014-01-16 16:53:13,524 config.py:322 - DEBUG - partition setting not
>>>>> specified. Defaulting to None
>>>>> 2014-01-16 16:53:13,525 config.py:322 - DEBUG - disable_queue setting
>>>>> not specified. Defaulting to False
>>>>> 2014-01-16 16:53:13,525 config.py:322 - DEBUG - volumes setting not
>>>>> specified. Defaulting to []
>>>>> 2014-01-16 16:53:13,525 config.py:322 - DEBUG - availability_zone
>>>>> setting not specified. Defaulting to None
>>>>> 2014-01-16 16:53:13,526 config.py:322 - DEBUG - spot_bid setting not
>>>>> specified. Defaulting to None
>>>>> 2014-01-16 16:53:13,526 config.py:322 - DEBUG - master_instance_type
>>>>> setting not specified. Defaulting to None
>>>>> 2014-01-16 16:53:13,526 config.py:322 - DEBUG - disable_cloudinit
>>>>> setting not specified. Defaulting to False
>>>>> 2014-01-16 16:53:13,526 config.py:322 - DEBUG - force_spot_master
>>>>> setting not specified. Defaulting to False
>>>>> 2014-01-16 16:53:13,526 config.py:322 - DEBUG - extends setting not
>>>>> specified. Defaulting to None
>>>>> 2014-01-16 16:53:13,526 config.py:322 - DEBUG - master_image_id
>>>>> setting not specified. Defaulting to None
>>>>> 2014-01-16 16:53:13,527 config.py:322 - DEBUG - userdata_scripts
>>>>> setting not specified. Defaulting to []
>>>>> 2014-01-16 16:53:13,527 config.py:322 - DEBUG - permissions setting
>>>>> not specified. Defaulting to []
>>>>> 2014-01-16 16:53:13,529 awsutils.py:74 - DEBUG - creating self._conn
>>>>> w/ connection_authenticator kwargs = {'proxy_user': None,
>>>>> 'proxy_pass': None, 'proxy_port': None, 'proxy': None, 'is_secure':
>>>>> True, 'path': '/', 'region': None, 'validate_certs': True, 'port':
>>>>> None}
>>>>> 2014-01-16 16:53:13,872 cluster.py:711 - DEBUG - existing nodes: {}
>>>>> 2014-01-16 16:53:13,872 cluster.py:719 - DEBUG - adding node
>>>>> i-7a50f654 to self._nodes list
>>>>> 2014-01-16 16:53:13,873 cluster.py:719 - DEBUG - adding node
>>>>> i-7950f657 to self._nodes list
>>>>> 2014-01-16 16:53:13,873 cluster.py:727 - DEBUG - returning self._nodes
>>>>> = [<Node: master (i-7950f657)>, <Node: node001 (i-7a50f654)>]
>>>>> 2014-01-16 16:53:14,063 cluster.py:711 - DEBUG - existing nodes:
>>>>> {u'i-7a50f654': <Node: node001 (i-7a50f654)>, u'i-7950f657': <Node:
>>>>> master (i-7950f657)>}
>>>>> 2014-01-16 16:53:14,064 cluster.py:714 - DEBUG - updating existing
>>>>> node i-7a50f654 in self._nodes
>>>>> 2014-01-16 16:53:14,064 cluster.py:714 - DEBUG - updating existing
>>>>> node i-7950f657 in self._nodes
>>>>> 2014-01-16 16:53:14,064 cluster.py:727 - DEBUG - returning self._nodes
>>>>> = [<Node: master (i-7950f657)>, <Node: node001 (i-7a50f654)>]
>>>>> 2014-01-16 16:53:14,168 node.py:1039 - DEBUG - Using native OpenSSH client
>>>>> 2014-01-16 16:53:14,169 node.py:1050 - DEBUG - ssh_cmd: ssh -i
>>>>> /home/rsignell/.ssh/mykey2.rsa
>>>>> root_at_ec2-54-204-55-67.compute-1.amazonaws.com
>>>>> Connection closed by 54.204.55.67
>>>>>
>>>>>
>>>>> I didn't see any "common problems" or "troubleshooting" sections in
>>>>> the starcluster documentation, and I checked the FAQ and the mailing
>>>>> list archives, but I probably overlooked something, as this certainly
>>>>> seems like a newbie question (which I am).
>>>>>
>>>>> Thanks,
>>>>> Rich
>>>>> --
>>>>> Dr. Richard P. Signell (508) 457-2229
>>>>> USGS, 384 Woods Hole Rd.
>>>>> Woods Hole, MA 02543-1598
>>>>> _______________________________________________
>>>>> StarCluster mailing list
>>>>> StarCluster_at_mit.edu
>>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>
>>>
>>>
>>> --
>>> Dr. Richard P. Signell (508) 457-2229
>>> USGS, 384 Woods Hole Rd.
>>> Woods Hole, MA 02543-1598
>
>
>
> --
> Dr. Richard P. Signell (508) 457-2229
> USGS, 384 Woods Hole Rd.
> Woods Hole, MA 02543-1598
Received on Fri Jan 17 2014 - 13:18:46 EST
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject