StarCluster - Mailing List Archive

Re: Crash report

From: Justin Riley <no email>
Date: Wed, 09 Nov 2011 16:31:11 -0500

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Sumita,

Before I respond, I want to inform/remind you that there's a new
0.92.1[1] version of StarCluster out. Please upgrade at your earliest
convenience given that it, among other useful new features[2], adds
additional info to the crash reports that are useful when debugging
issues.

With that said, for some unknown reason the creation of a placement
group for your cc1.4xlarge cluster has failed. This is most likely due
to some flakiness with EC2 at the moment given that StarCluster
doesn't do anything fancy when creating the placement group - it just
uses boto to make an CreatePlacementGroup EC2 API call[3] directly.
FWIW, I just tested by creating a new placement group with the same
'smallcuster9' cluster name and it worked fine for me...

Without a detailed error it's hard to figure out what the proper
course of action is besides just retrying the failed 'start' command.
Unfortunately, Amazon's error reporting for this particular failure is
not really documented very well (hardly at all for that matter). From
my experience it just gives you a 'success' or 'failure' True/False
response. Fortunately, a failure at this point means no instances were
launched so it doesn't cost you anything in terms of instance hours.
Simply 'terminate' the failed cluster and retry the 'start' command
you used previously to try again.

I've updated the latest git code[4] to better communicate the error
when a placement group creation failure occurs. The latest git code
will also log the error status returned by boto so if it happens again
we can inspect with more details in the future. However, to my
knowledge, this error status is only a True/False value which is not
very useful in determining *what* went wrong and whether or not
starcluster can 'just fix it' internally.

If you're able to reliably reproduce this issue I would recommend
inspecting the error a little more closely using the development shell:

$ sudo easy_install ipython
$ starcluster shell
[~]> status = ec2.conn.create_placement_group('_at_sc-smallcluster9')
[~]> print status

This will use boto directly to create the placement group and return
the status. If you can get useful error messages from the above
procedure please reply and send the output.

HTH,

~Justin

[1] http://web.mit.edu/starcluster/docs/latest/
[2] http://web.mit.edu/starcluster/docs/latest/changelog.html#features
[3]
http://docs.amazonwebservices.com/AWSEC2/latest/APIReference/ApiReference-query-CreatePlacementGroup.html
[4] https://github.com/jtriley/StarCluster

On 11/09/2011 11:38 AM, Sumita Sinha wrote:
> Hi,
>
> I was creating 10 node cluster , EBS type nodes. Crashed twice.
> Please find the crash report .
>
>
> 2011-11-09 16:32:45,749 PID: 28520 config.py:524 - DEBUG - Loading
> config 2011-11-09 16:32:45,751 PID: 28520 config.py:118 - DEBUG -
> Loading file: /root/.starcluster/config 2011-11-09 16:32:45,753
> PID: 28520 config.py:524 - DEBUG - Loading config 2011-11-09
> 16:32:45,754 PID: 28520 config.py:118 - DEBUG - Loading file:
> /root/.starcluster/config 2011-11-09 16:32:45,756 PID: 28520
> awsutils.py:55 - DEBUG - creating self._conn w/
> connection_authenticator kwargs = {'path': '/', 'region': None,
> 'port': None, 'is_secure': True} 2011-11-09 16:32:45,834 PID: 28520
> start.py:179 - INFO - Using default cluster template: smallcluster
> 2011-11-09 16:32:45,834 PID: 28520 cluster.py:1487 - INFO -
> Validating cluster template settings... 2011-11-09 16:32:46,136
> PID: 28520 cluster.py:888 - DEBUG - Launch map: node001 (ami:
> ami-12b6477b, type: cc1.4xlarge)... 2011-11-09 16:32:46,137 PID:
> 28520 cluster.py:888 - DEBUG - Launch map: node002 (ami:
> ami-12b6477b, type: cc1.4xlarge)... 2011-11-09 16:32:46,137 PID:
> 28520 cluster.py:888 - DEBUG - Launch map: node003 (ami:
> ami-12b6477b, type: cc1.4xlarge)... 2011-11-09 16:32:46,137 PID:
> 28520 cluster.py:888 - DEBUG - Launch map: node004 (ami:
> ami-12b6477b, type: cc1.4xlarge)... 2011-11-09 16:32:46,137 PID:
> 28520 cluster.py:888 - DEBUG - Launch map: node005 (ami:
> ami-12b6477b, type: cc1.4xlarge)... 2011-11-09 16:32:46,137 PID:
> 28520 cluster.py:888 - DEBUG - Launch map: node006 (ami:
> ami-12b6477b, type: cc1.4xlarge)... 2011-11-09 16:32:46,138 PID:
> 28520 cluster.py:888 - DEBUG - Launch map: node007 (ami:
> ami-12b6477b, type: cc1.4xlarge)... 2011-11-09 16:32:46,138 PID:
> 28520 cluster.py:888 - DEBUG - Launch map: node008 (ami:
> ami-12b6477b, type: cc1.4xlarge)... 2011-11-09 16:32:46,138 PID:
> 28520 cluster.py:888 - DEBUG - Launch map: node009 (ami:
> ami-12b6477b, type: cc1.4xlarge)... 2011-11-09 16:32:46,183 PID:
> 28520 cluster.py:1502 - INFO - Cluster template settings are valid
> 2011-11-09 16:32:46,184 PID: 28520 cluster.py:1387 - INFO -
> Starting cluster... 2011-11-09 16:32:46,184 PID: 28520
> cluster.py:914 - INFO - Launching a 10-node cluster... 2011-11-09
> 16:32:46,184 PID: 28520 cluster.py:888 - DEBUG - Launch map:
> node001 (ami: ami-12b6477b, type: cc1.4xlarge)... 2011-11-09
> 16:32:46,184 PID: 28520 cluster.py:888 - DEBUG - Launch map:
> node002 (ami: ami-12b6477b, type: cc1.4xlarge)... 2011-11-09
> 16:32:46,185 PID: 28520 cluster.py:888 - DEBUG - Launch map:
> node003 (ami: ami-12b6477b, type: cc1.4xlarge)... 2011-11-09
> 16:32:46,185 PID: 28520 cluster.py:888 - DEBUG - Launch map:
> node004 (ami: ami-12b6477b, type: cc1.4xlarge)... 2011-11-09
> 16:32:46,185 PID: 28520 cluster.py:888 - DEBUG - Launch map:
> node005 (ami: ami-12b6477b, type: cc1.4xlarge)... 2011-11-09
> 16:32:46,185 PID: 28520 cluster.py:888 - DEBUG - Launch map:
> node006 (ami: ami-12b6477b, type: cc1.4xlarge)... 2011-11-09
> 16:32:46,185 PID: 28520 cluster.py:888 - DEBUG - Launch map:
> node007 (ami: ami-12b6477b, type: cc1.4xlarge)... 2011-11-09
> 16:32:46,186 PID: 28520 cluster.py:888 - DEBUG - Launch map:
> node008 (ami: ami-12b6477b, type: cc1.4xlarge)... 2011-11-09
> 16:32:46,186 PID: 28520 cluster.py:888 - DEBUG - Launch map:
> node009 (ami: ami-12b6477b, type: cc1.4xlarge)... 2011-11-09
> 16:32:46,186 PID: 28520 cluster.py:941 - DEBUG - Launching master
> (ami: ami-12b6477b, type: cc1.4xlarge) 2011-11-09 16:32:46,186 PID:
> 28520 cluster.py:941 - DEBUG - Launching node001 (ami:
> ami-12b6477b, type: cc1.4xlarge) 2011-11-09 16:32:46,186 PID: 28520
> cluster.py:941 - DEBUG - Launching node002 (ami: ami-12b6477b,
> type: cc1.4xlarge) 2011-11-09 16:32:46,187 PID: 28520
> cluster.py:941 - DEBUG - Launching node003 (ami: ami-12b6477b,
> type: cc1.4xlarge) 2011-11-09 16:32:46,187 PID: 28520
> cluster.py:941 - DEBUG - Launching node004 (ami: ami-12b6477b,
> type: cc1.4xlarge) 2011-11-09 16:32:46,187 PID: 28520
> cluster.py:941 - DEBUG - Launching node005 (ami: ami-12b6477b,
> type: cc1.4xlarge) 2011-11-09 16:32:46,187 PID: 28520
> cluster.py:941 - DEBUG - Launching node006 (ami: ami-12b6477b,
> type: cc1.4xlarge) 2011-11-09 16:32:46,187 PID: 28520
> cluster.py:941 - DEBUG - Launching node007 (ami: ami-12b6477b,
> type: cc1.4xlarge) 2011-11-09 16:32:46,187 PID: 28520
> cluster.py:941 - DEBUG - Launching node008 (ami: ami-12b6477b,
> type: cc1.4xlarge) 2011-11-09 16:32:46,188 PID: 28520
> cluster.py:941 - DEBUG - Launching node009 (ami: ami-12b6477b,
> type: cc1.4xlarge) 2011-11-09 16:32:46,247 PID: 28520
> awsutils.py:160 - INFO - Creating security group
> _at_sc-smallcluster9... 2011-11-09 16:32:46,932 PID: 28520
> awsutils.py:270 - INFO - Creating placement group
> _at_sc-smallcluster9... 2011-11-09 16:32:47,083 PID: 28520 cli.py:189
> - DEBUG - Traceback (most recent call last): File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.92-py2.6.egg/starcluster/cli.py",
>
>
line 157, in main
> sc.execute(args) File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.92-py2.6.egg/starcluster/commands/start.py",
>
>
line 193, in execute
> validate_running=validate_running) File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.92-py2.6.egg/starcluster/cluster.py",
>
>
line 1374, in start
> return self._start(create=create, create_only=create_only) File
> "<string>", line 2, in _start File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.92-py2.6.egg/starcluster/utils.py",
>
>
line 86, in wrap_f
> res = func(*arg, **kargs) File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.92-py2.6.egg/starcluster/cluster.py",
>
>
line 1389, in _start
> self.create_cluster() File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.92-py2.6.egg/starcluster/cluster.py",
>
>
line 920, in create_cluster
> self._create_flat_rate_cluster() File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.92-py2.6.egg/starcluster/cluster.py",
>
>
line 945, in _create_flat_rate_cluster
> force_flat=True)[0] File
> "/usr/local/lib/python2.6/dist-packages/StarCluster-0.92-py2.6.egg/starcluster/cluster.py",
>
>
line 735, in create_nodes
> placement_group = self.placement_group.name
> <http://self.placement_group.name> AttributeError: 'NoneType'
> object has no attribute 'name'
>
>
> -- Regards Sumita Sinha
>
>

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk668R8ACgkQ4llAkMfDcrkAXQCfcrwbYo2YotvNwIjgI9Rc8N+V
bngAn3U0hkJiUlE9CH3ppgC5DmdqHADq
=+l5S
-----END PGP SIGNATURE-----
Received on Wed Nov 09 2011 - 16:31:14 EST
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject