StarCluster - Mailing List Archive

Re: crash report

From: Justin Riley <no email>
Date: Wed, 17 Oct 2012 13:57:27 -0400

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Olgert,

Thanks for reporting. For some reason the SSH connection dropped which
unfortunately the code doesn't handle well as you experienced. I need
to implement some sort of auto-reconnect (with a cap on the number of
retries) to fix this.

- From your log it looks the volume was created but never resized. You
have a couple options in that case:

1. Delete the new volume and auto-generated snapshot of the original
volume and try again. No need to terminate the volhost cluster in this
case - it will be reused on the next launch.

2. If you still have the new 500GB volume you can simply attach it to
an instance and run resize2fs on the device to finish the job. It's
also safe to delete the auto-generated snapshot of the original volume.

HTH,

~Justin

On 09/07/2012 11:08 AM, Denas, Olgert wrote:
> I encountered this while running starcluster.
>
> best, olgert
>
> [odenas_at_res5 ~]$ cat
> /home/users/odenas/.starcluster/logs/crash-report-29370.txt
> ---------- CRASH DETAILS ---------- COMMAND: starcluster
> resizevolume vol-5ae88a21 500 2012-09-07 09:12:36,618 PID: 29370
> config.py:551 - DEBUG - Loading config 2012-09-07 09:12:36,618 PID:
> 29370 config.py:118 - DEBUG - Loading file:
> /home/users/odenas/.starcluster/config 2012-09-07 09:12:36,621 PID:
> 29370 awsutils.py:54 - DEBUG - creating self._conn w/
> connection_authenticator kwargs = {'proxy_user': None,
> 'proxy_pass': None, 'proxy_port': None, 'proxy': None, 'is_secure':
> True, 'path': '/', 'region': None, 'port': None} 2012-09-07
> 09:12:36,799 PID: 29370 createvolume.py:77 - INFO - No keypair
> specified, picking one from config... 2012-09-07 09:12:36,873 PID:
> 29370 createvolume.py:83 - INFO - Using keypair: reskey 2012-09-07
> 09:12:36,951 PID: 29370 cluster.py:664 - DEBUG - existing nodes:
> {} 2012-09-07 09:12:36,951 PID: 29370 cluster.py:680 - DEBUG -
> returning self._nodes = [] 2012-09-07 09:12:37,127 PID: 29370
> awsutils.py:165 - INFO - Creating security group
> _at_sc-volumecreator... 2012-09-07 09:12:38,074 PID: 29370
> volume.py:77 - INFO - No instance in group _at_sc-volumecreator for
> zone us-east-1c, launching one now. 2012-09-07 09:12:38,604 PID:
> 29370 cluster.py:772 - INFO - Reservation:r-ff05b698 2012-09-07
> 09:12:38,604 PID: 29370 cluster.py:1235 - INFO - Waiting for volume
> host to come up... (updating every 30s) 2012-09-07 09:12:39,424
> PID: 29370 cluster.py:664 - DEBUG - existing nodes: {} 2012-09-07
> 09:12:39,424 PID: 29370 cluster.py:672 - DEBUG - adding node
> i-fa74fb80 to self._nodes list 2012-09-07 09:12:39,822 PID: 29370
> cluster.py:680 - DEBUG - returning self._nodes = [<Node:
> volhost-us-east-1c (i-fa74fb80)>] 2012-09-07 09:12:39,822 PID:
> 29370 cluster.py:1193 - INFO - Waiting for all nodes to be in a
> 'running' state... 2012-09-07 09:12:39,925 PID: 29370
> cluster.py:664 - DEBUG - existing nodes: {u'i-fa74fb80': <Node:
> volhost-us-east-1c (i-fa74fb80)>} 2012-09-07 09:12:39,926 PID:
> 29370 cluster.py:667 - DEBUG - updating existing node i-fa74fb80 in
> self._nodes 2012-09-07 09:12:39,926 PID: 29370 cluster.py:680 -
> DEBUG - returning self._nodes = [<Node: volhost-us-east-1c
> (i-fa74fb80)>] 2012-09-07 09:13:10,107 PID: 29370 cluster.py:664 -
> DEBUG - existing nodes: {u'i-fa74fb80': <Node: volhost-us-east-1c
> (i-fa74fb80)>} 2012-09-07 09:13:10,107 PID: 29370 cluster.py:667 -
> DEBUG - updating existing node i-fa74fb80 in self._nodes 2012-09-07
> 09:13:10,108 PID: 29370 cluster.py:680 - DEBUG - returning
> self._nodes = [<Node: volhost-us-east-1c (i-fa74fb80)>] 2012-09-07
> 09:13:10,108 PID: 29370 cluster.py:1211 - INFO - Waiting for SSH to
> come up on all nodes... 2012-09-07 09:13:10,204 PID: 29370
> cluster.py:664 - DEBUG - existing nodes: {u'i-fa74fb80': <Node:
> volhost-us-east-1c (i-fa74fb80)>} 2012-09-07 09:13:10,204 PID:
> 29370 cluster.py:667 - DEBUG - updating existing node i-fa74fb80 in
> self._nodes 2012-09-07 09:13:10,204 PID: 29370 cluster.py:680 -
> DEBUG - returning self._nodes = [<Node: volhost-us-east-1c
> (i-fa74fb80)>] 2012-09-07 09:13:10,290 PID: 29370 __init__.py:75 -
> DEBUG - loading private key /home/users/odenas/.ssh/reskey.rsa
> 2012-09-07 09:13:10,291 PID: 29370 __init__.py:167 - DEBUG - Using
> private key /home/users/odenas/.ssh/reskey.rsa (rsa) 2012-09-07
> 09:13:10,291 PID: 29370 __init__.py:97 - DEBUG - connecting to host
> ec2-50-19-35-106.compute-1.amazonaws.com on port 22 as user root
> 2012-09-07 09:13:49,512 PID: 29370 cluster.py:664 - DEBUG -
> existing nodes: {u'i-fa74fb80': <Node: volhost-us-east-1c
> (i-fa74fb80)>} 2012-09-07 09:13:49,513 PID: 29370 cluster.py:667 -
> DEBUG - updating existing node i-fa74fb80 in self._nodes 2012-09-07
> 09:13:49,513 PID: 29370 cluster.py:680 - DEBUG - returning
> self._nodes = [<Node: volhost-us-east-1c (i-fa74fb80)>] 2012-09-07
> 09:13:49,600 PID: 29370 __init__.py:97 - DEBUG - connecting to host
> ec2-50-19-35-106.compute-1.amazonaws.com on port 22 as user root
> 2012-09-07 09:13:50,015 PID: 29370 __init__.py:186 - DEBUG -
> creating sftp connection 2012-09-07 09:13:50,927 PID: 29370
> utils.py:93 - INFO - Waiting for cluster to come up took 1.205
> mins 2012-09-07 09:13:51,010 PID: 29370 cluster.py:664 - DEBUG -
> existing nodes: {u'i-fa74fb80': <Node: volhost-us-east-1c
> (i-fa74fb80)>} 2012-09-07 09:13:51,010 PID: 29370 cluster.py:667 -
> DEBUG - updating existing node i-fa74fb80 in self._nodes 2012-09-07
> 09:13:51,010 PID: 29370 cluster.py:680 - DEBUG - returning
> self._nodes = [<Node: volhost-us-east-1c (i-fa74fb80)>] 2012-09-07
> 09:13:51,011 PID: 29370 volume.py:187 - INFO - Checking for
> required remote commands... 2012-09-07 09:13:51,072 PID: 29370
> __init__.py:543 - DEBUG - /sbin/resize2fs 2012-09-07 09:13:51,072
> PID: 29370 awsutils.py:1019 - INFO - Creating snapshot of volume:
> vol-5ae88a21 2012-09-07 09:13:52,184 PID: 29370 awsutils.py:1002 -
> INFO - Waiting for snapshot to complete: snap-01e56e73 2012-09-07
> 10:58:14,046 PID: 29370 volume.py:95 - INFO - Creating 500GB volume
> in zone us-east-1c from snapshot snap-01e56e73 2012-09-07
> 10:58:15,180 PID: 29370 volume.py:97 - INFO - New volume id:
> vol-17146f6c 2012-09-07 10:58:15,180 PID: 29370 cluster.py:1140 -
> INFO - Waiting for new volume to become 'available'... 2012-09-07
> 10:58:21,298 PID: 29370 cluster.py:1140 - INFO - Attaching volume
> vol-17146f6c to instance i-fa74fb80... 2012-09-07 10:58:32,634 PID:
> 29370 volume.py:222 - WARNING - There are still volume hosts
> running: i-fa74fb80 2012-09-07 10:58:32,635 PID: 29370
> volume.py:225 - WARNING - Run 'starcluster terminate volumecreator'
> to terminate *all* volume host instances once they're no longer
> needed 2012-09-07 10:58:32,659 PID: 29370 cli.py:287 - DEBUG -
> Traceback (most recent call last): File
> "/home/users/odenas/.local/lib/python2.7/site-packages/StarCluster-0.93.3-py2.7.egg/starcluster/cli.py",
> line 255, in main sc.execute(args) File
> "/home/users/odenas/.local/lib/python2.7/site-packages/StarCluster-0.93.3-py2.7.egg/starcluster/commands/resizevolume.py",
> line 86, in execute new_volid = vc.resize(vol, size,
> dest_zone=self.opts.dest_zone) File
> "/home/users/odenas/.local/lib/python2.7/site-packages/StarCluster-0.93.3-py2.7.egg/starcluster/volume.py",
> line 316, in resize device = self._get_volume_device() File
> "/home/users/odenas/.local/lib/python2.7/site-packages/StarCluster-0.93.3-py2.7.egg/starcluster/volume.py",
> line 117, in _get_volume_device if inst.ssh.path_exists(dev): File
> "/home/users/odenas/.local/lib/python2.7/site-packages/StarCluster-0.93.3-py2.7.egg/starcluster/sshutils/__init__.py",
> line 306, in path_exists self.stat(path) File
> "/home/users/odenas/.local/lib/python2.7/site-packages/StarCluster-0.93.3-py2.7.egg/starcluster/sshutils/__init__.py",
> line 371, in stat return self.sftp.stat(path) File
> "build/bdist.linux-x86_64/egg/ssh/sftp_client.py", line 337, in
> stat t, msg = self._request(CMD_STAT, path) File
> "build/bdist.linux-x86_64/egg/ssh/sftp_client.py", line 635, in
> _request return self._read_response(num) File
> "build/bdist.linux-x86_64/egg/ssh/sftp_client.py", line 667, in
> _read_response raise SSHException('Server connection dropped: %s' %
> (str(e),)) SSHException: Server connection dropped:
>
> ---------- SYSTEM INFO ---------- StarCluster: 0.93.3 Python:
> 2.7rc1 (r27rc1:81772, Jun 9 2010, 17:29:16) [GCC 4.1.2 20080704
> (Red Hat 4.1.2-46)] Platform:
> Linux-2.6.18-274.el5-x86_64-with-redhat-5.8-Tikanga boto: 2.3.0
> ssh: 1.7.13 Crypto: 2.6 jinja2: 2.6 decorator: 3.3.1
>
>
> ________________________________
>
> This e-mail message (including any attachments) is for the sole use
> of the intended recipient(s) and may contain confidential and
> privileged information. If the reader of this message is not the
> intended recipient, you are hereby notified that any dissemination,
> distribution or copying of this message (including any attachments)
> is strictly prohibited.
>
> If you have received this message in error, please contact the
> sender by reply e-mail message and destroy all copies of the
> original message (including attachments).
>
> _______________________________________________ StarCluster mailing
> list StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://www.enigmail.net/

iEYEARECAAYFAlB+8YYACgkQ4llAkMfDcrlkCQCfV/ZRRIAcNbh4NVmdEcAc7jLR
WSMAn3mks8IlJ9rWOrXqccCxOyPzkPqO
=rG3b
-----END PGP SIGNATURE-----
Received on Wed Oct 17 2012 - 13:57:30 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject