Folks:
First: any good way to search the archives? I tried various google strings to no good effect. I hate to duplicate effort/messages ...
More importantly: A possible bug? Sometimes when starting SPOT_BID clusters (~30% of the time?) I'm seeing 'start' skip (apparently) "Waiting for open spot requests to become active..." and just process the master. When it works correctly, I see:
>>> Launching node001 (ami: ami-12b6477b, type: cc1.4xlarge)
SpotInstanceRequest:sir-9f38a214
>>> Launching node002 (ami: ami-12b6477b, type: cc1.4xlarge)
SpotInstanceRequest:sir-c4505a11
>>> Launching node003 (ami: ami-12b6477b, type: cc1.4xlarge)
SpotInstanceRequest:sir-cbb32414
>>> Waiting for cluster to come up... (updating every 20s)
>>> Waiting for open spot requests to become active...
0/3 | | 0%
When it doesn't work correctly, I see the following, where it skips the highlighted section above and goes straight to 'Waiting for all nodes', and the count is /1 instead of /4 (or whatever the CLUSTER_SIZE is).
# starcluster start -c spottest spottest
StarCluster - (
http://web.mit.edu/starcluster) (v. 0.93.3)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster_at_mit.edu
>>> Validating cluster template settings...
>>> Cluster template settings are valid
>>> Starting cluster...
>>> Launching a 4-node cluster...
>>> Launching master node (ami: ami-12b6477b, type: cc1.4xlarge)...
>>> Creating security group _at_sc-spottest...
>>> Opening tcp port range 22-22 for CIDR XXXXXXXXXX/22
>>> Creating placement group _at_sc-spottest...
Reservation:r-02fbac61
>>> Launching node001 (ami: ami-12b6477b, type: cc1.4xlarge)
SpotInstanceRequest:sir-6cb0f014
>>> Launching node002 (ami: ami-12b6477b, type: cc1.4xlarge)
SpotInstanceRequest:sir-b0ff9e11
>>> Launching node003 (ami: ami-12b6477b, type: cc1.4xlarge)
SpotInstanceRequest:sir-2ef6f814
>>> Waiting for cluster to come up... (updating every 20s)
>>> Waiting for all nodes to be in a 'running' state...
1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Waiting for SSH to come up on all nodes...
1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Waiting for cluster to come up took 3.547 mins
>>> The master node is ec2-184-72-156-11.compute-1.amazonaws.com
I haven't tried this with anything but 'bigger' stuff (cc1 & cc2), so don't know if that has any bearing on the situation. My config:
[global]
DEFAULT_TEMPLATE=Rcluster
ENABLE_EXPERIMENTAL=True
REFRESH_INTERVAL=20
[aws info]
AWS_ACCESS_KEY_ID = XXXXXXXXXXXX
AWS_SECRET_ACCESS_KEY = XXXXXXXXXXXXX
AWS_USER_ID = XXXXXXXXXXX
EC2_CERT = XXXXXXXXXXX.pem
EC2_PRIVATE_KEY = XXXXXXXXXXXXX.pem
[key mykey]
KEY_LOCATION=XXXXXXXXXXXXXXX.pem
[cluster spottest]
KEYNAME = mykey
CLUSTER_SIZE = 4
CLUSTER_USER = sgeadmin
CLUSTER_SHELL = bash
NODE_IMAGE_ID = ami-12b6477b
NODE_INSTANCE_TYPE = cc1.4xlarge
AVAILABILITY_ZONE = us-east-1c
VOLUMES = Rlocal-spottest
PLUGINS = setup-centos
PERMISSIONS = ssh-local
SPOT_BID = 1.50
[volume Rlocal-spottest]
VOLUME_ID = vol-XXXXXXXXXX
MOUNT_PATH = /usr/local
[plugin setup-centos]
setup_class = setup-centos.PackageInstaller
pkg_to_install = R
[permission ssh-local]
protocol = tcp
from_port = 22
to_port = 22
cidr_ip = XXXXXXXXXXX/22
This exact config works sometimes, other times not. Thanks for listening, or any advice you might have.
-Hugh
Received on Thu Apr 12 2012 - 13:30:16 EDT