Thank you Justin for your reply,
1. I started using your ami for the GPU "ami-4583572c" as you see in the
config file for the first cluster template that is commented out and not
used now, but that's when I first created the initial volume on the AWS
console and it as by default in the different region "us-east-1c" while the
ami was in "us-east-1a", so the first time it didn't connect because of the
different region, but the error message was not clear, so, I didn't
understand about the region until I used ec2 commands and it said it
clearly and the next days I received the replies in this mail group
confirming that it is the reason. Then when Rayson said I should use
starcluster createvolume command, I deleted the volume and recreated as
required in the instructions and terminated the volumecreator, but I think
I didn't see it attached once I started sfmcluster, I think it only
appeared after using the AWS console,
I did my configurations and installations and downloads, then created a new
image "ami-fae74193" and it is available for public now if you need to have
a look, using this command and yes while the volume was attached:
ec2-create-image instanceID --name sfmimage --description 'GPU Cluster
Ubunto with VisualSFM, MeshlabServer, FFMPEG' -K mykeypath/pkfile.pem
Now, I am using the second cluster "mysfmcluster" using my ami
"ami-fae74193" and I think I had to detach from the AWS console, and even
force detach, to attach to the new cluster. I kept both running for a while
to test, and not sure if trying to detach while the cluster is running is
the problem, but it took a while, and I am not sure if I had to terminate
the first cluster before attaching to second or not, but I remember I had
to terminate first,
2. As mentioned on point "1", I created the first volume using AWS console,
and in different region, then recreated using starcluster commands, and in
both ways was 30GB unpartitioned, and I didn't see errors,
3. Yes, as seen on the attached file,
4. I just started the cluster now, and it is attached this time, it might
be my problems, or that I didn't wait enough till everything is available,
however the bad news, there is another problem that didn't stop me from
sshmaster afterwards, and the screen output is copied below and I think in
the attached debug.log,
5. both files attached,
thanks again for your support,
$ starcluster start mysfmcluster
StarCluster - (http://web.mit.edu/starcluster
) (v. 0.93.3)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster_at_mit.edu
>>> Using default cluster template: mysfmcluster
>>> Validating cluster template settings...
>>> Cluster template settings are valid
>>> Starting cluster...
>>> Launching a 1-node cluster...
>>> Launching master node (ami: ami-fae74193, type: cg1.4xlarge)...
>>> Creating security group _at_sc-mysfmcluster...
>>> Creating placement group _at_sc-mysfmcluster...
>>> Waiting for cluster to come up... (updating every 30s)
>>> Waiting for open spot requests to become active...
>>> Waiting for all nodes to be in a 'running' state...
>>> Waiting for SSH to come up on all nodes...
>>> Waiting for cluster to come up took 8.399 mins
>>> The master node is ec2-23-20-139-233.compute-1.amazonaws.com
>>> Setting up the cluster...
>>> Attaching volume vol-69bd4807 to master node on /dev/sdz ...
>>> Configuring hostnames...
>>> Mounting EBS volume vol-69bd4807 on /home...
>>> Creating cluster user: None (uid: 1001, gid: 1001)
>>> Configuring scratch space for user(s): sgeadmin
>>> Configuring /etc/hosts on each node
>>> Starting NFS server on master
>>> Setting up NFS took 0.073 mins
>>> Configuring passwordless ssh for root
>>> Shutting down threads...
Traceback (most recent call last):
line 255, in main
line 194, in execute
line 1414, in start
return self._start(create=create, create_only=create_only)
File "<string>", line 2, in _start
line 87, in wrap_f
res = func(*arg, **kargs)
line 1437, in _start
line 1446, in setup_cluster
File "<string>", line 2, in _setup_cluster
line 87, in wrap_f
res = func(*arg, **kargs)
line 1460, in _setup_cluster
line 350, in run
line 225, in _setup_passwordless_ssh
line 418, in generate_key_for_user
key = self.ssh.load_remote_rsa_key(private_key)
line 210, in load_remote_rsa_key
key = ssh.RSAKey(file_obj=rfile)
File "build/bdist.macosx-10.6-universal/egg/ssh/rsakey.py", line 48, in
File "build/bdist.macosx-10.6-universal/egg/ssh/rsakey.py", line 167, in
data = self._read_private_key('RSA', file_obj, password)
File "build/bdist.macosx-10.6-universal/egg/ssh/pkey.py", line 323, in
raise PasswordRequiredException('Private key file is encrypted')
PasswordRequiredException: Private key file is encrypted
!!! ERROR - Oops! Looks like you've found a bug in StarCluster
!!! ERROR - Crash report written to:
!!! ERROR - Please remove any sensitive data from the crash report
!!! ERROR - and submit it to starcluster_at_mit.edu
On 23 May 2012 04:49, Justin Riley <jtriley_at_mit.edu> wrote:
> StarCluster chooses which device to attach external EBS volumes on
> automatically - you do not and should not need to specify this in your
> config. Assuming you use 'createvolume' and update your config correctly
> things should "just work".
> You should not have to use the AWS console to attach volumes manually
> and if you're having to do this then I'd like to figure out why so we
> can fix it. This is a core feature of StarCluster and many users are
> using external EBS with StarCluster without issue so I'm extremely
> curious why you're having issues...
> With that said I'm having trouble pulling out all of the details I need
> from this long thread so I'll ask direct questions instead:
> 1. Which AMI are you using? Did you create the AMI yourself? If so how
> did you go about creating the AMI and did you have any external EBS
> volumes attached while creating the AMI?
> 2. How did you create the volume you were having issues mounting with
> StarCluster? StarCluster expects your volume to either be completely
> unpartitioned (format entire device) or only contain a single partition.
> If this isn't the case you should see an error when starting a cluster.
> 3. Did you add your volume to your cluster config correctly according to
> the docs? (ie add your volume to the VOLUMES list in your cluster
> 4. StarCluster should be spitting out errors when creating the cluster
> if it fails to attach/mount/NFS-share any external EBS volumes - did you
> notice any errors? Can you please attach the complete screen output of a
> failed StarCluster run? Also it would extremely useful if you could send
> me your ~/.starcluster/logs/debug.log for a failed run so that I can
> take a look.
> 5. Would you mind sending me a copy of your config with all of the
> sensitive data removed? I just want to make sure you've configured
> things as expected.
Received on Tue May 22 2012 - 21:37:21 EDT
- application/octet-stream attachment: debug.log