Re: "connection closed" when running "starcluster ssmaster"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

From: Rayson Ho <no email>
Date: Fri, 17 Jan 2014 16:11:22 -0500

Hmm, looks like your master machine has some files corrupted, as
/opt/sge6 is also missing... You may want to reboot the instance, and
the easiest way to do that without relying on StarCluster or other
services is to go to the AWS Web management console
(https://aws.amazon.com/console/ ), sign in, and then select the
"master" instance, right-click, and then reboot.

It shouldn't take more than a few minutes to reboot the master
instance. However, if there are too many files corrupted, the reboot
will fail. If you have important data on the instance, you will then
need to stop the instance and then mount the EBS volume as a data
partition on another EC2 instance. (That's slightly more
complicated...) On the other hand, if it is a test cluster, it is much
faster and cheaper to destroy the cluster and create a new one.

Rayson

==================================================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/
http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html

On Fri, Jan 17, 2014 at 3:54 PM, Signell, Richard <rsignell_at_usgs.gov> wrote:
> root_at_node001:~# qrsh -l h=master
> The program 'qrsh' is currently not installed. You can install it by typing:
> apt-get install gridengine-client
>
> I then tried installing gridengine-client, but I wasn't sure of all
> the choices, and got some worrisome output:
>
> setting default_transport: error
> setting relay_transport: error
> /etc/aliases does not exist, creating it.
> WARNING: /etc/aliases exists, but does not have a root alias.
>
>
> At the end, when I try the command, it still fails:
>
> root_at_node001:~# qrsh -l h=master
> error: cell directory "/opt/sge6/default" doesn't exist
>
>
> On Fri, Jan 17, 2014 at 1:18 PM, Rayson Ho <raysonlogin_at_gmail.com> wrote:
>> On Fri, Jan 17, 2014 at 11:11 AM, Signell, Richard <rsignell_at_usgs.gov> wrote:
>>> Rayson,
>>>
>>> Okay, I tried your suggestion:
>>>
>>> 1. I can ssh to node001 just fine.
>>>
>>
>> That's good, so it means that the private key on your side works. And
>> on Amazon's side, the slave node also has a working public key.
>>
>>
>>> 2. I cannot ssh to master:
>>>
>> ...
>>>
>>> Does that give a clue?
>>
>> So it seems like the master node doesn't like your key. You can try:
>>
>> - First SSH into node001, and then run "ssh master".
>>
>> - And if the above still fails, try to use the Grid Engine mechanism
>> to get back into the master node. So, first SSH into node001. Then run
>> "qrsh -l h=master":
>>
>> root_at_node001:~# qrsh -l h=master
>> groups: cannot find name for group ID 20003
>> root_at_master:~#
>>
>> (Ignore the complain about the extra Group ID (GID), it is an
>> additional GID added by Grid Engine.
>>
>>
>>> [also, is it okay to have this type of discussion on this list]?
>>
>> As long as it is StarCluster, Amazon EC2, and Grid Engine related,
>> then I am happy to help! :-D
>>
>> Rayson
>>
>> ==================================================
>> Open Grid Scheduler - The Official Open Source Grid Engine
>> http://gridscheduler.sourceforge.net/
>> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>>
>>
>>>
>>> -Rich
>>>
>>> On Fri, Jan 17, 2014 at 10:56 AM, Rayson Ho <raysonlogin_at_gmail.com> wrote:
>>>> Did you overwrite your SSH private key
>>>> (/home/rsignell/.ssh/mykey2.rsa) with a new one?
>>>>
>>>> Also, can you run the SSH client directly from the command line with
>>>> verbose (-v) on and see if that gives you anything?
>>>>
>>>> Example:
>>>>
>>>> % ssh -v -i /home/rsignell/.ssh/mykey2.rsa
>>>> root_at_ec2-54-196-2-68.compute-1.amazonaws.com
>>>>
>>>> Rayson
>>>>
>>>> ==================================================
>>>> Open Grid Scheduler - The Official Open Source Grid Engine
>>>> http://gridscheduler.sourceforge.net/
>>>> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>>>>
>>>>
>>>> On Fri, Jan 17, 2014 at 8:03 AM, Signell, Richard <rsignell_at_usgs.gov> wrote:
>>>>> Rayson,
>>>>>
>>>>> I tried ssh'ing into node001 as you suggested, but the process just
>>>>> seems to hang. I waited 5 minutes, tried to ctrl-c, ctrl-z, nothing
>>>>> worked. Finally killed terminal.
>>>>>
>>>>> What should I try next?
>>>>>
>>>>> rsignell_at_gam:~$ starcluster -d sshnode rps_cluster node001
>>>>> StarCluster - (http://star.mit.edu/cluster) (v. 0.94.3)
>>>>> Software Tools for Academics and Researchers (STAR)
>>>>> Please submit bug reports to starcluster_at_mit.edu
>>>>>
>>>>> 2014-01-17 07:59:34,900 config.py:567 - DEBUG - Loading config
>>>>> 2014-01-17 07:59:34,900 config.py:138 - DEBUG - Loading file:
>>>>> /home/rsignell/.starcluster/config
>>>>> ...
>>>>>
>>>>> 2014-01-17 07:59:34,935 awsutils.py:74 - DEBUG - creating self._conn
>>>>> w/ connection_authenticator kwargs = {'proxy_user': None,
>>>>> 'proxy_pass': None, 'proxy_port': None, 'proxy': None, 'is_secure':
>>>>> True, 'path': '/', 'region': None, 'validate_certs': True, 'port':
>>>>> None}
>>>>> 2014-01-17 07:59:35,797 cluster.py:711 - DEBUG - existing nodes: {}
>>>>> 2014-01-17 07:59:35,797 cluster.py:719 - DEBUG - adding node
>>>>> i-7a50f654 to self._nodes list
>>>>> 2014-01-17 07:59:35,797 cluster.py:719 - DEBUG - adding node
>>>>> i-7950f657 to self._nodes list
>>>>> 2014-01-17 07:59:35,798 cluster.py:727 - DEBUG - returning self._nodes
>>>>> = [<Node: master (i-7950f657)>, <Node: node001 (i-7a50f654)>]
>>>>> 2014-01-17 07:59:35,905 cluster.py:711 - DEBUG - existing nodes:
>>>>> {u'i-7a50f654': <Node: node001 (i-7a50f654)>, u'i-7950f657': <Node:
>>>>> master (i-7950f657)>}
>>>>> 2014-01-17 07:59:35,906 cluster.py:714 - DEBUG - updating existing
>>>>> node i-7a50f654 in self._nodes
>>>>> 2014-01-17 07:59:35,906 cluster.py:714 - DEBUG - updating existing
>>>>> node i-7950f657 in self._nodes
>>>>> 2014-01-17 07:59:35,906 cluster.py:727 - DEBUG - returning self._nodes
>>>>> = [<Node: master (i-7950f657)>, <Node: node001 (i-7a50f654)>]
>>>>> 2014-01-17 07:59:36,119 node.py:1039 - DEBUG - Using native OpenSSH client
>>>>> 2014-01-17 07:59:36,119 node.py:1050 - DEBUG - ssh_cmd: ssh -i
>>>>> /home/rsignell/.ssh/mykey2.rsa
>>>>> root_at_ec2-54-196-2-68.compute-1.amazonaws.com
>>>>> [wait, wait.... nothing....]
>>>>>
>>>>>
>>>>> On Thu, Jan 16, 2014 at 5:24 PM, Rayson Ho <raysonlogin_at_gmail.com> wrote:
>>>>>> The SSH daemon is responding (and the EC2 security group is not
>>>>>> blocking traffic), which is good.
>>>>>>
>>>>>> However, logging onto the master was working a few hours ago and not
>>>>>> anymore, then try to log onto the Grid Engine execution node by using,
>>>>>> for example, "starcluster sshnode rps_cluster node001". If SSHing into
>>>>>> the execution node works, then it is likely to be an issue with the
>>>>>> StarCluster master instance.
>>>>>>
>>>>>> Rayson
>>>>>>
>>>>>> ==================================================
>>>>>> Open Grid Scheduler - The Official Open Source Grid Engine
>>>>>> http://gridscheduler.sourceforge.net/
>>>>>> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>>>>>>
>>>>>>
>>>>>> On Thu, Jan 16, 2014 at 4:55 PM, Signell, Richard <rsignell_at_usgs.gov> wrote:
>>>>>>> I set up a machine this morning and
>>>>>>> starcluster sshmaster rps_cluster
>>>>>>> was working fine to ssh in.
>>>>>>>
>>>>>>> But now I'm getting "Connection closed by 54.204.55.67"
>>>>>>>
>>>>>>> It seem that the cluster is running:
>>>>>>>
>>>>>>> rsignell_at_gam:~$ starcluster listclusters
>>>>>>> StarCluster - (http://star.mit.edu/cluster) (v. 0.94.3)
>>>>>>> Software Tools for Academics and Researchers (STAR)
>>>>>>> Please submit bug reports to starcluster_at_mit.edu
>>>>>>>
>>>>>>> ---------------------------------------------
>>>>>>> rps_cluster (security group: _at_sc-rps_cluster)
>>>>>>> ---------------------------------------------
>>>>>>> Launch time: 2014-01-16 08:18:09
>>>>>>> Uptime: 0 days, 08:34:07
>>>>>>> Zone: us-east-1a
>>>>>>> Keypair: mykey2
>>>>>>> EBS volumes: N/A
>>>>>>> Cluster nodes:
>>>>>>> master running i-7950f657 ec2-54-204-55-67.compute-1.amazonaws.com
>>>>>>> node001 running i-7a50f654 ec2-54-196-2-68.compute-1.amazonaws.com
>>>>>>> Total nodes: 2
>>>>>>>
>>>>>>> And I don't see anything obvious in the verbose debug output:
>>>>>>>
>>>>>>> rsignell_at_gam:~$ starcluster -d sshmaster rps_cluster
>>>>>>> StarCluster - (http://star.mit.edu/cluster) (v. 0.94.3)
>>>>>>> Software Tools for Academics and Researchers (STAR)
>>>>>>> Please submit bug reports to starcluster_at_mit.edu
>>>>>>>
>>>>>>> 2014-01-16 16:53:13,515 config.py:567 - DEBUG - Loading config
>>>>>>> 2014-01-16 16:53:13,515 config.py:138 - DEBUG - Loading file:
>>>>>>> /home/rsignell/.starcluster/config
>>>>>>> 2014-01-16 16:53:13,517 config.py:322 - DEBUG - include setting not
>>>>>>> specified. Defaulting to []
>>>>>>> 2014-01-16 16:53:13,518 config.py:322 - DEBUG - web_browser setting
>>>>>>> not specified. Defaulting to None
>>>>>>> 2014-01-16 16:53:13,518 config.py:322 - DEBUG - refresh_interval
>>>>>>> setting not specified. Defaulting to 30
>>>>>>> 2014-01-16 16:53:13,518 config.py:322 - DEBUG - include setting not
>>>>>>> specified. Defaulting to []
>>>>>>> 2014-01-16 16:53:13,518 config.py:322 - DEBUG - web_browser setting
>>>>>>> not specified. Defaulting to None
>>>>>>> 2014-01-16 16:53:13,519 config.py:322 - DEBUG - refresh_interval
>>>>>>> setting not specified. Defaulting to 30
>>>>>>> 2014-01-16 16:53:13,519 config.py:322 - DEBUG - aws_proxy_pass setting
>>>>>>> not specified. Defaulting to None
>>>>>>> 2014-01-16 16:53:13,519 config.py:322 - DEBUG - aws_validate_certs
>>>>>>> setting not specified. Defaulting to True
>>>>>>> 2014-01-16 16:53:13,520 config.py:322 - DEBUG - aws_ec2_path setting
>>>>>>> not specified. Defaulting to /
>>>>>>> 2014-01-16 16:53:13,520 config.py:322 - DEBUG - aws_region_name
>>>>>>> setting not specified. Defaulting to None
>>>>>>> 2014-01-16 16:53:13,521 config.py:322 - DEBUG - aws_region_host
>>>>>>> setting not specified. Defaulting to None
>>>>>>> 2014-01-16 16:53:13,521 config.py:322 - DEBUG - aws_s3_path setting
>>>>>>> not specified. Defaulting to /
>>>>>>> 2014-01-16 16:53:13,521 config.py:322 - DEBUG - aws_proxy_user setting
>>>>>>> not specified. Defaulting to None
>>>>>>> 2014-01-16 16:53:13,521 config.py:322 - DEBUG - aws_is_secure setting
>>>>>>> not specified. Defaulting to True
>>>>>>> 2014-01-16 16:53:13,522 config.py:322 - DEBUG - aws_s3_host setting
>>>>>>> not specified. Defaulting to None
>>>>>>> 2014-01-16 16:53:13,522 config.py:322 - DEBUG - aws_port setting not
>>>>>>> specified. Defaulting to None
>>>>>>> 2014-01-16 16:53:13,522 config.py:322 - DEBUG - ec2_private_key
>>>>>>> setting not specified. Defaulting to None
>>>>>>> 2014-01-16 16:53:13,522 config.py:322 - DEBUG - ec2_cert setting not
>>>>>>> specified. Defaulting to None
>>>>>>> 2014-01-16 16:53:13,523 config.py:322 - DEBUG - aws_proxy setting not
>>>>>>> specified. Defaulting to None
>>>>>>> 2014-01-16 16:53:13,523 config.py:322 - DEBUG - aws_proxy_port setting
>>>>>>> not specified. Defaulting to None
>>>>>>> 2014-01-16 16:53:13,523 config.py:322 - DEBUG - device setting not
>>>>>>> specified. Defaulting to None
>>>>>>> 2014-01-16 16:53:13,523 config.py:322 - DEBUG - partition setting not
>>>>>>> specified. Defaulting to None
>>>>>>> 2014-01-16 16:53:13,524 config.py:322 - DEBUG - device setting not
>>>>>>> specified. Defaulting to None
>>>>>>> 2014-01-16 16:53:13,524 config.py:322 - DEBUG - partition setting not
>>>>>>> specified. Defaulting to None
>>>>>>> 2014-01-16 16:53:13,525 config.py:322 - DEBUG - disable_queue setting
>>>>>>> not specified. Defaulting to False
>>>>>>> 2014-01-16 16:53:13,525 config.py:322 - DEBUG - volumes setting not
>>>>>>> specified. Defaulting to []
>>>>>>> 2014-01-16 16:53:13,525 config.py:322 - DEBUG - availability_zone
>>>>>>> setting not specified. Defaulting to None
>>>>>>> 2014-01-16 16:53:13,526 config.py:322 - DEBUG - spot_bid setting not
>>>>>>> specified. Defaulting to None
>>>>>>> 2014-01-16 16:53:13,526 config.py:322 - DEBUG - master_instance_type
>>>>>>> setting not specified. Defaulting to None
>>>>>>> 2014-01-16 16:53:13,526 config.py:322 - DEBUG - disable_cloudinit
>>>>>>> setting not specified. Defaulting to False
>>>>>>> 2014-01-16 16:53:13,526 config.py:322 - DEBUG - force_spot_master
>>>>>>> setting not specified. Defaulting to False
>>>>>>> 2014-01-16 16:53:13,526 config.py:322 - DEBUG - extends setting not
>>>>>>> specified. Defaulting to None
>>>>>>> 2014-01-16 16:53:13,526 config.py:322 - DEBUG - master_image_id
>>>>>>> setting not specified. Defaulting to None
>>>>>>> 2014-01-16 16:53:13,527 config.py:322 - DEBUG - userdata_scripts
>>>>>>> setting not specified. Defaulting to []
>>>>>>> 2014-01-16 16:53:13,527 config.py:322 - DEBUG - permissions setting
>>>>>>> not specified. Defaulting to []
>>>>>>> 2014-01-16 16:53:13,529 awsutils.py:74 - DEBUG - creating self._conn
>>>>>>> w/ connection_authenticator kwargs = {'proxy_user': None,
>>>>>>> 'proxy_pass': None, 'proxy_port': None, 'proxy': None, 'is_secure':
>>>>>>> True, 'path': '/', 'region': None, 'validate_certs': True, 'port':
>>>>>>> None}
>>>>>>> 2014-01-16 16:53:13,872 cluster.py:711 - DEBUG - existing nodes: {}
>>>>>>> 2014-01-16 16:53:13,872 cluster.py:719 - DEBUG - adding node
>>>>>>> i-7a50f654 to self._nodes list
>>>>>>> 2014-01-16 16:53:13,873 cluster.py:719 - DEBUG - adding node
>>>>>>> i-7950f657 to self._nodes list
>>>>>>> 2014-01-16 16:53:13,873 cluster.py:727 - DEBUG - returning self._nodes
>>>>>>> = [<Node: master (i-7950f657)>, <Node: node001 (i-7a50f654)>]
>>>>>>> 2014-01-16 16:53:14,063 cluster.py:711 - DEBUG - existing nodes:
>>>>>>> {u'i-7a50f654': <Node: node001 (i-7a50f654)>, u'i-7950f657': <Node:
>>>>>>> master (i-7950f657)>}
>>>>>>> 2014-01-16 16:53:14,064 cluster.py:714 - DEBUG - updating existing
>>>>>>> node i-7a50f654 in self._nodes
>>>>>>> 2014-01-16 16:53:14,064 cluster.py:714 - DEBUG - updating existing
>>>>>>> node i-7950f657 in self._nodes
>>>>>>> 2014-01-16 16:53:14,064 cluster.py:727 - DEBUG - returning self._nodes
>>>>>>> = [<Node: master (i-7950f657)>, <Node: node001 (i-7a50f654)>]
>>>>>>> 2014-01-16 16:53:14,168 node.py:1039 - DEBUG - Using native OpenSSH client
>>>>>>> 2014-01-16 16:53:14,169 node.py:1050 - DEBUG - ssh_cmd: ssh -i
>>>>>>> /home/rsignell/.ssh/mykey2.rsa
>>>>>>> root_at_ec2-54-204-55-67.compute-1.amazonaws.com
>>>>>>> Connection closed by 54.204.55.67
>>>>>>>
>>>>>>>
>>>>>>> I didn't see any "common problems" or "troubleshooting" sections in
>>>>>>> the starcluster documentation, and I checked the FAQ and the mailing
>>>>>>> list archives, but I probably overlooked something, as this certainly
>>>>>>> seems like a newbie question (which I am).
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Rich
>>>>>>> --
>>>>>>> Dr. Richard P. Signell (508) 457-2229
>>>>>>> USGS, 384 Woods Hole Rd.
>>>>>>> Woods Hole, MA 02543-1598
>>>>>>> _______________________________________________
>>>>>>> StarCluster mailing list
>>>>>>> StarCluster_at_mit.edu
>>>>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Dr. Richard P. Signell (508) 457-2229
>>>>> USGS, 384 Woods Hole Rd.
>>>>> Woods Hole, MA 02543-1598
>>>
>>>
>>>
>>> --
>>> Dr. Richard P. Signell (508) 457-2229
>>> USGS, 384 Woods Hole Rd.
>>> Woods Hole, MA 02543-1598
>
>
>
> --
> Dr. Richard P. Signell (508) 457-2229
> USGS, 384 Woods Hole Rd.
> Woods Hole, MA 02543-1598
Received on Fri Jan 17 2014 - 16:11:25 EST

This message: [ Message body ]
Next message: Signell, Richard: "Re: "connection closed" when running "starcluster ssmaster""
Previous message: Signell, Richard: "Re: "connection closed" when running "starcluster ssmaster""
In reply to: Signell, Richard: "Re: "connection closed" when running "starcluster ssmaster""
Next in thread: Signell, Richard: "Re: "connection closed" when running "starcluster ssmaster""
Reply: Signell, Richard: "Re: "connection closed" when running "starcluster ssmaster""

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Navigation

Re: "connection closed" when running "starcluster ssmaster"

Search:

Sort all by:

Navigation