StarCluster - Mailing List Archive

Re: Spot Instance Startup Failure (Bug?)

From: Justin Riley <no email>
Date: Fri, 20 Apr 2012 10:08:07 -0400

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Hugh,

After many tries I was able to reproduce this and can confirm it's a
transient issue related to polling for spot requests too quickly.

I'm working on a patch now but in the mean time if this happens simply
CTRL-C the 'start' command and then run the same start command again
with the -x option. The second run should work as expected given that
some time will go by and the spot instance requests will be available.

I've created an issue to keep track of this:

http://web.mit.edu/star/cluster/issues/105

~Justin

On 4/12/12 1:30 PM, MacMullan, Hugh wrote:
>
> Folks:
>
>
>
> First: any good way to search the archives? I tried various google
strings to no good effect. I hate to duplicate effort/messages ?
>
>
>
> More importantly: A possible bug? Sometimes when starting SPOT_BID
clusters (~30% of the time?) I'm seeing ?start? skip (apparently)
?Waiting for open spot requests to become active?? and just process the
master. When it works correctly, I see:
>
>
>
> >>> Launching node001 (ami: ami-12b6477b, type: cc1.4xlarge)
>
> SpotInstanceRequest:sir-9f38a214
>
> >>> Launching node002 (ami: ami-12b6477b, type: cc1.4xlarge)
>
> SpotInstanceRequest:sir-c4505a11
>
> >>> Launching node003 (ami: ami-12b6477b, type: cc1.4xlarge)
>
> SpotInstanceRequest:sir-cbb32414
>
> >>> Waiting for cluster to come up... (updating every 20s)
>
> >>> Waiting for open spot requests to become active...
>
> 0/3 | | 0%
>
>
>
> When it doesn?t work correctly, I see the following, where it skips the
highlighted section above and goes straight to ?Waiting for all nodes?,
and the count is /1 instead of /4 (or whatever the CLUSTER_SIZE is).
>
>
>
> # starcluster start -c spottest spottest
>
> StarCluster - (http://web.mit.edu/starcluster) (v. 0.93.3)
>
> Software Tools for Academics and Researchers (STAR)
>
> Please submit bug reports to starcluster_at_mit.edu
>
>
>
> >>> Validating cluster template settings...
>
> >>> Cluster template settings are valid
>
> >>> Starting cluster...
>
> >>> Launching a 4-node cluster...
>
> >>> Launching master node (ami: ami-12b6477b, type: cc1.4xlarge)...
>
> >>> Creating security group _at_sc-spottest...
>
> >>> Opening tcp port range 22-22 for CIDR XXXXXXXXXX/22
>
> >>> Creating placement group _at_sc-spottest...
>
> Reservation:r-02fbac61
>
> >>> Launching node001 (ami: ami-12b6477b, type: cc1.4xlarge)
>
> SpotInstanceRequest:sir-6cb0f014
>
> >>> Launching node002 (ami: ami-12b6477b, type: cc1.4xlarge)
>
> SpotInstanceRequest:sir-b0ff9e11
>
> >>> Launching node003 (ami: ami-12b6477b, type: cc1.4xlarge)
>
> SpotInstanceRequest:sir-2ef6f814
>
> >>> Waiting for cluster to come up... (updating every 20s)
>
> >>> Waiting for all nodes to be in a 'running' state...
>
> 1/1
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>
> >>> Waiting for SSH to come up on all nodes...
>
> 1/1
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>
> >>> Waiting for cluster to come up took 3.547 mins
>
> >>> The master node is ec2-184-72-156-11.compute-1.amazonaws.com
>
>
>
> I haven?t tried this with anything but ?bigger? stuff (cc1 & cc2), so
don?t know if that has any bearing on the situation. My config:
>
>
>
> [global]
>
> DEFAULT_TEMPLATE=Rcluster
>
> ENABLE_EXPERIMENTAL=True
>
> REFRESH_INTERVAL=20
>
>
>
> [aws info]
>
> AWS_ACCESS_KEY_ID = XXXXXXXXXXXX
>
> AWS_SECRET_ACCESS_KEY = XXXXXXXXXXXXX
>
> AWS_USER_ID = XXXXXXXXXXX
>
> EC2_CERT = XXXXXXXXXXX.pem
>
> EC2_PRIVATE_KEY = XXXXXXXXXXXXX.pem
>
>
>
> [key mykey]
>
> KEY_LOCATION=XXXXXXXXXXXXXXX.pem
>
>
>
> [cluster spottest]
>
> KEYNAME = mykey
>
> CLUSTER_SIZE = 4
>
> CLUSTER_USER = sgeadmin
>
> CLUSTER_SHELL = bash
>
> NODE_IMAGE_ID = ami-12b6477b
>
> NODE_INSTANCE_TYPE = cc1.4xlarge
>
> AVAILABILITY_ZONE = us-east-1c
>
> VOLUMES = Rlocal-spottest
>
> PLUGINS = setup-centos
>
> PERMISSIONS = ssh-local
>
> SPOT_BID = 1.50
>
>
>
> [volume Rlocal-spottest]
>
> VOLUME_ID = vol-XXXXXXXXXX
>
> MOUNT_PATH = /usr/local
>
>
>
> [plugin setup-centos]
>
> setup_class = setup-centos.PackageInstaller
>
> pkg_to_install = R
>
>
>
> [permission ssh-local]
>
> protocol = tcp
>
> from_port = 22
>
> to_port = 22
>
> cidr_ip = XXXXXXXXXXX/22
>
>
>
> This exact config works sometimes, other times not. Thanks for
listening, or any advice you might have.
>
> -Hugh
>
>
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk+RbccACgkQ4llAkMfDcrnEYgCeKmUcGy8spO9I2sgHOVfQeE03
pS0AniRXrGY3ObOXZ26R6emB2fs5B5eg
=QRb4
-----END PGP SIGNATURE-----
Received on Fri Apr 20 2012 - 10:08:09 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject