StarCluster - Mailing List Archive

Spot Instance Startup Failure (Bug?)

From: MacMullan, Hugh <no email>
Date: Thu, 12 Apr 2012 17:30:14 +0000

Folks:



First: any good way to search the archives? I tried various google strings to no good effect. I hate to duplicate effort/messages ...



More importantly: A possible bug? Sometimes when starting SPOT_BID clusters (~30% of the time?) I'm seeing 'start' skip (apparently) "Waiting for open spot requests to become active..." and just process the master. When it works correctly, I see:



>>> Launching node001 (ami: ami-12b6477b, type: cc1.4xlarge)

SpotInstanceRequest:sir-9f38a214

>>> Launching node002 (ami: ami-12b6477b, type: cc1.4xlarge)

SpotInstanceRequest:sir-c4505a11

>>> Launching node003 (ami: ami-12b6477b, type: cc1.4xlarge)

SpotInstanceRequest:sir-cbb32414

>>> Waiting for cluster to come up... (updating every 20s)

>>> Waiting for open spot requests to become active...

0/3 | | 0%



When it doesn't work correctly, I see the following, where it skips the highlighted section above and goes straight to 'Waiting for all nodes', and the count is /1 instead of /4 (or whatever the CLUSTER_SIZE is).



# starcluster start -c spottest spottest

StarCluster - (http://web.mit.edu/starcluster) (v. 0.93.3)

Software Tools for Academics and Researchers (STAR)

Please submit bug reports to starcluster_at_mit.edu



>>> Validating cluster template settings...

>>> Cluster template settings are valid

>>> Starting cluster...

>>> Launching a 4-node cluster...

>>> Launching master node (ami: ami-12b6477b, type: cc1.4xlarge)...

>>> Creating security group _at_sc-spottest...

>>> Opening tcp port range 22-22 for CIDR XXXXXXXXXX/22

>>> Creating placement group _at_sc-spottest...

Reservation:r-02fbac61

>>> Launching node001 (ami: ami-12b6477b, type: cc1.4xlarge)

SpotInstanceRequest:sir-6cb0f014

>>> Launching node002 (ami: ami-12b6477b, type: cc1.4xlarge)

SpotInstanceRequest:sir-b0ff9e11

>>> Launching node003 (ami: ami-12b6477b, type: cc1.4xlarge)

SpotInstanceRequest:sir-2ef6f814

>>> Waiting for cluster to come up... (updating every 20s)

>>> Waiting for all nodes to be in a 'running' state...

1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%

>>> Waiting for SSH to come up on all nodes...

1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%

>>> Waiting for cluster to come up took 3.547 mins

>>> The master node is ec2-184-72-156-11.compute-1.amazonaws.com



I haven't tried this with anything but 'bigger' stuff (cc1 & cc2), so don't know if that has any bearing on the situation. My config:



[global]

DEFAULT_TEMPLATE=Rcluster

ENABLE_EXPERIMENTAL=True

REFRESH_INTERVAL=20



[aws info]

AWS_ACCESS_KEY_ID = XXXXXXXXXXXX

AWS_SECRET_ACCESS_KEY = XXXXXXXXXXXXX

AWS_USER_ID = XXXXXXXXXXX

EC2_CERT = XXXXXXXXXXX.pem

EC2_PRIVATE_KEY = XXXXXXXXXXXXX.pem



[key mykey]

KEY_LOCATION=XXXXXXXXXXXXXXX.pem



[cluster spottest]

KEYNAME = mykey

CLUSTER_SIZE = 4

CLUSTER_USER = sgeadmin

CLUSTER_SHELL = bash

NODE_IMAGE_ID = ami-12b6477b

NODE_INSTANCE_TYPE = cc1.4xlarge

AVAILABILITY_ZONE = us-east-1c

VOLUMES = Rlocal-spottest

PLUGINS = setup-centos

PERMISSIONS = ssh-local

SPOT_BID = 1.50



[volume Rlocal-spottest]

VOLUME_ID = vol-XXXXXXXXXX

MOUNT_PATH = /usr/local



[plugin setup-centos]

setup_class = setup-centos.PackageInstaller

pkg_to_install = R



[permission ssh-local]

protocol = tcp

from_port = 22

to_port = 22

cidr_ip = XXXXXXXXXXX/22



This exact config works sometimes, other times not. Thanks for listening, or any advice you might have.

-Hugh
Received on Thu Apr 12 2012 - 13:30:16 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject