StarCluster - Mailing List Archive

Re: [Starcluster] starcluster fails to mount /home with NFS

From: Justin Riley <no email>
Date: Fri, 16 Apr 2010 13:06:58 -0400

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Gabriel,

I just tested my new, not yet released, StarCluster Ubuntu 9.10 karmic
AMIs and it works with c1.xlarge just fine.

Unfortunately this AMI will not work with the current version of
StarCluster on pypi given that it uses the latest Sun Grid Engine which
has slightly different config parameters.

However, if you want to use the github code which I'm hoping to release
next week anyway, this should work for you. The current work-in-progress
docs for the next release are here:

http://web.mit.edu/stardev/cluster/docs

Let me know if you want to give this a try. I'm going to make those new
9.10 karmic AMIs available today after some testing.

~Justin

On 04/16/2010 11:31 AM, Justin Riley wrote:
> Hi Gabriel,
>
> I'm also able to reproduce your troubles with NFS on c1.xlarge. I'm
> guessing you've tested the smallest 64bit instance available as well and
> it has the same problem?
>
> Unfortunately I haven't been able to get to the bottom of this just yet.
> The RPC error is not very common and I think this may have something to
> do with the kernel but I'm not really sure.
>
> I'll keep you posted with what I discover. I have a feeling I'll need to
> release a new 64bit ami because of this but we'll see.
>
> ~Justin
>
> On 04/16/2010 09:00 AM, Gabriel Hoffman wrote:
>> Justin,
>> On the master I get:
>
>> root_at_domU-12-31-39-0E-3E-11:~# /etc/init.d/portmap status
>> * portmap is running
>
>> root_at_domU-12-31-39-0E-3E-11:~# pgrep portmap -lf
>> 916 /sbin/portmap
>
>> On a node I get:
>
>> root_at_domU-12-31-39-00-13-C2:~# /etc/init.d/portmap status
>> * portmap is running
>
>> root_at_domU-12-31-39-00-13-C2:~# pgrep portmap -lf
>> 843 /sbin/portmap
>
>
>> This NFS problem happens repeatedly so I haven't been able to use Amazon
>> for my qsub jobs. Early on it worked ok on a 2 node cluster, but I
>> tried to scale up and it fails every time now. During the course of
>> trying to diagnose the problem I changed my security group settings,
>> although it seems that starcluster would occasionally reset them.
>> Currently I have:
>
>> All
>
>> tcp
>
>> 0
>
>> 65535
>
>> 0.0.0.0/0
>
>> --
>
>> Remove
>> All
>
>> tcp
>
>> 0
>
>> 65535
>
>> default group
>
>
>
>> Remove
>> All
>
>> udp
>
>> 0
>
>> 65535
>
>> 0.0.0.0/0
>
>> --
>
>> Remove
>> All
>
>> udp
>
>> 0
>
>> 65535
>
>> default group
>
>
>
>> Remove
>> SSH
>
>> tcp
>
>> 22
>
>> 22
>
>> 0.0.0.0/0
>
>
>
>> Thanks for your help,
>> Gabriel
>
>> On 04/15/2010 05:39 PM, Justin Riley wrote:
>> Hi Gabriel,
>
>> Thanks for joining the list and sorry to hear you're having issues using
>> StarCluster.
>
>> This looks like a problem with portmap starting. Could you check if
>> portmap is running on the nodes?
>
>> Try something like:
>
>> $ /etc/init.d/portmap status
>
>> and
>
>> $ pgrep portmap -lf
>
>> I'm testing now with the ami-a19e71c8 image and c1.xlarge instances to
>> see if I can reproduce your issue.
>
>> Has this happened to you more than once?
>
>> ~Justin
>
>
>
>> On 04/15/2010 04:56 PM, Gabriel Hoffman wrote:
>
>>>>> I am using starcluster to launch 3 c1.xlarge instances using
>>>>> ami-a19e71c8. starcluster -s works fine and does not generate any
>>>>> errors and my EBS is correctly mounted on /home in the master. My
>>>>> problem is that starcluster does not correctly mount this on the nodes
>>>>> using NFS. When I try to mount manually, I get the following error:
>>>>>
>>>>> root_at_domU-12-31-39-0F-6D-D1:/home# mount /home
>>>>> mount.nfs: mount to NFS server
>>>>> 'domU-12-31-39-0F-6C-F1.compute-1.internal' failed: RPC Error: Program
>>>>> not registered
>>>>>
>>>>> This also fails with m1.large and ami-a19e71c8. However, the NFS works
>>>>> fine when I use m1.small instances with ami-8f9e71e6.
>>>>>
>>>>> Any ideas about how to get starcluster to use NFS correctly with c1.xlarge?
>>>>>
>>>>> Here is my .starclustercfg:
>>>>>
>>>>>
>>>>> [section aws]
>>>>>
>>>>> ---
>>>>>
>>>>> [section ssh]
>>>>>
>>>>> ---
>>>>>
>>>>> [section cluster]
>>>>>
>>>>> # cluster size
>>>>> CLUSTER_SIZE = 3
>>>>>
>>>>> # create the following user on the cluster
>>>>> CLUSTER_USER = sgeadmin
>>>>>
>>>>> # optionally specify shell (defaults to bash)
>>>>> # options: bash, zsh, csh, ksh, tcsh
>>>>> CLUSTER_SHELL = bash
>>>>>
>>>>> # AMI for master node. Defaults to NODE_IMAGE_ID if not specified
>>>>> # The base i386 StarCluster AMI is ami-8f9e71e6
>>>>> # The base x86_64 StarCluster AMI is ami-a19e71c8
>>>>> MASTER_IMAGE_ID = ami-a19e71c8
>>>>>
>>>>>
>>>>> # AMI for worker nodes. Also used for the master node if MASTER_IMAGE_ID
>>>>> is not specified
>>>>> # The base i386 StarCluster AMI is ami-8f9e71e6
>>>>> # The base x86_64 StarCluster AMI is ami-a19e71c8
>>>>> NODE_IMAGE_ID = ami-a19e71c8
>>>>>
>>>>> # instance type
>>>>> INSTANCE_TYPE = c1.xlarge
>>>>>
>>>>> # availability zone
>>>>> AVAILABILITY_ZONE = us-east-1a
>>>>>
>>>>> [section ebs]
>>>>>
>>>>> # NOTE: this section is optional, uncomment to use
>>>>> # attach volume to /home on master node
>>>>>
>>>>> ATTACH_VOLUME = vol-598b2130
>>>>>
>>>>> VOLUME_DEVICE = /dev/sdj
>>>>>
>>>>> VOLUME_PARTITION = /dev/sdj1
>>>>>
>>>>>
>>>>>
>>>>>
>
>> --
>> Gabriel Hoffman
>
>> PhD Candidate
>> Genetics and Development
>> Cornell University
>
>
_______________________________________________
Starcluster mailing list
Starcluster_at_mit.edu
http://mailman.mit.edu/mailman/listinfo/starcluster

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkvImTIACgkQ4llAkMfDcrnG+wCgjA6KD1FixiX4MiTj9q+XsrLl
D5AAnR0UIalIrCAqAZ8sA5KXGwvM33mK
=B/Dq
-----END PGP SIGNATURE-----
Received on Fri Apr 16 2010 - 13:07:00 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject