StarCluster - Mailing List Archive

Re: [Starcluster] starcluster fails to mount /home with NFS

From: Justin Riley <no email>
Date: Fri, 16 Apr 2010 11:31:50 -0400

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Gabriel,

I'm also able to reproduce your troubles with NFS on c1.xlarge. I'm
guessing you've tested the smallest 64bit instance available as well and
it has the same problem?

Unfortunately I haven't been able to get to the bottom of this just yet.
The RPC error is not very common and I think this may have something to
do with the kernel but I'm not really sure.

I'll keep you posted with what I discover. I have a feeling I'll need to
release a new 64bit ami because of this but we'll see.

~Justin

On 04/16/2010 09:00 AM, Gabriel Hoffman wrote:
> Justin,
> On the master I get:
>
> root_at_domU-12-31-39-0E-3E-11:~# /etc/init.d/portmap status
> * portmap is running
>
> root_at_domU-12-31-39-0E-3E-11:~# pgrep portmap -lf
> 916 /sbin/portmap
>
> On a node I get:
>
> root_at_domU-12-31-39-00-13-C2:~# /etc/init.d/portmap status
> * portmap is running
>
> root_at_domU-12-31-39-00-13-C2:~# pgrep portmap -lf
> 843 /sbin/portmap
>
>
> This NFS problem happens repeatedly so I haven't been able to use Amazon
> for my qsub jobs. Early on it worked ok on a 2 node cluster, but I
> tried to scale up and it fails every time now. During the course of
> trying to diagnose the problem I changed my security group settings,
> although it seems that starcluster would occasionally reset them.
> Currently I have:
>
> All
>
> tcp
>
> 0
>
> 65535
>
> 0.0.0.0/0
>
> --
>
> Remove
> All
>
> tcp
>
> 0
>
> 65535
>
> default group
>
>
>
> Remove
> All
>
> udp
>
> 0
>
> 65535
>
> 0.0.0.0/0
>
> --
>
> Remove
> All
>
> udp
>
> 0
>
> 65535
>
> default group
>
>
>
> Remove
> SSH
>
> tcp
>
> 22
>
> 22
>
> 0.0.0.0/0
>
>
>
> Thanks for your help,
> Gabriel
>
> On 04/15/2010 05:39 PM, Justin Riley wrote:
> Hi Gabriel,
>
> Thanks for joining the list and sorry to hear you're having issues using
> StarCluster.
>
> This looks like a problem with portmap starting. Could you check if
> portmap is running on the nodes?
>
> Try something like:
>
> $ /etc/init.d/portmap status
>
> and
>
> $ pgrep portmap -lf
>
> I'm testing now with the ami-a19e71c8 image and c1.xlarge instances to
> see if I can reproduce your issue.
>
> Has this happened to you more than once?
>
> ~Justin
>
>
>
> On 04/15/2010 04:56 PM, Gabriel Hoffman wrote:
>
>>>> I am using starcluster to launch 3 c1.xlarge instances using
>>>> ami-a19e71c8. starcluster -s works fine and does not generate any
>>>> errors and my EBS is correctly mounted on /home in the master. My
>>>> problem is that starcluster does not correctly mount this on the nodes
>>>> using NFS. When I try to mount manually, I get the following error:
>>>>
>>>> root_at_domU-12-31-39-0F-6D-D1:/home# mount /home
>>>> mount.nfs: mount to NFS server
>>>> 'domU-12-31-39-0F-6C-F1.compute-1.internal' failed: RPC Error: Program
>>>> not registered
>>>>
>>>> This also fails with m1.large and ami-a19e71c8. However, the NFS works
>>>> fine when I use m1.small instances with ami-8f9e71e6.
>>>>
>>>> Any ideas about how to get starcluster to use NFS correctly with c1.xlarge?
>>>>
>>>> Here is my .starclustercfg:
>>>>
>>>>
>>>> [section aws]
>>>>
>>>> ---
>>>>
>>>> [section ssh]
>>>>
>>>> ---
>>>>
>>>> [section cluster]
>>>>
>>>> # cluster size
>>>> CLUSTER_SIZE = 3
>>>>
>>>> # create the following user on the cluster
>>>> CLUSTER_USER = sgeadmin
>>>>
>>>> # optionally specify shell (defaults to bash)
>>>> # options: bash, zsh, csh, ksh, tcsh
>>>> CLUSTER_SHELL = bash
>>>>
>>>> # AMI for master node. Defaults to NODE_IMAGE_ID if not specified
>>>> # The base i386 StarCluster AMI is ami-8f9e71e6
>>>> # The base x86_64 StarCluster AMI is ami-a19e71c8
>>>> MASTER_IMAGE_ID = ami-a19e71c8
>>>>
>>>>
>>>> # AMI for worker nodes. Also used for the master node if MASTER_IMAGE_ID
>>>> is not specified
>>>> # The base i386 StarCluster AMI is ami-8f9e71e6
>>>> # The base x86_64 StarCluster AMI is ami-a19e71c8
>>>> NODE_IMAGE_ID = ami-a19e71c8
>>>>
>>>> # instance type
>>>> INSTANCE_TYPE = c1.xlarge
>>>>
>>>> # availability zone
>>>> AVAILABILITY_ZONE = us-east-1a
>>>>
>>>> [section ebs]
>>>>
>>>> # NOTE: this section is optional, uncomment to use
>>>> # attach volume to /home on master node
>>>>
>>>> ATTACH_VOLUME = vol-598b2130
>>>>
>>>> VOLUME_DEVICE = /dev/sdj
>>>>
>>>> VOLUME_PARTITION = /dev/sdj1
>>>>
>>>>
>>>>
>>>>

> --
> Gabriel Hoffman

> PhD Candidate
> Genetics and Development
> Cornell University


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkvIguYACgkQ4llAkMfDcrmTGACeJKDflhtWKgttY740IqK3gUqe
VKEAn2Tj27/jdhP24Ufu6h5Ysq0hrEfp
=vZSn
-----END PGP SIGNATURE-----
Received on Fri Apr 16 2010 - 11:31:52 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject