StarCluster - Mailing List Archive

Re: Addnode SGE problem

From: Daniel Polhamus <no email>
Date: Mon, 26 Nov 2012 14:44:50 -0500

Thanks Ron, that was helpful. It looks like the NFS share of my EBS volume
isn't being setup correctly when using addnode. In my config file, I've
set MOUNT_PATH=/data, yet when I use addnode:

>>> Configuring NFS exports path(s):
/home

Sure enough, if I make changes using the master node in /home/danp, I can
see those changes on node001 whereas there is no /data on node001.

Should I create a bug report for this, or is this message sufficient?
Dan


On Mon, Nov 26, 2012 at 2:20 PM, Ron Chen <ron_chen_123_at_yahoo.com> wrote:

> Do you have the danp user on node001?
>
> Also, you should check the execd's messages file
> ($SGE_ROOT/default/spool/<host>/messages) to find out why the job caused
> errors.
>
> http://gridscheduler.sourceforge.net/howto/troubleshooting.html
>
> -Ron
>
>
>
> ________________________________
> From: Daniel Polhamus <danp_at_metrumrg.com>
> To: starcluster_at_mit.edu
> Sent: Monday, November 26, 2012 1:55 PM
> Subject: [StarCluster] Addnode SGE problem
>
>
> Hi all,
>
> I've run into a problem with "addnode" that I'm having a difficult time
> diagnosing. Using the development version of starcluster, when I issue a
> starcluster addnode, the nodes added in the resulting cluster are unusable
> -- they result in SGE errors. Jobs run on the master node, but any nodes
> I've added are broken. If, however, I start the cluster with multiple
> nodes then resulting nodes are all usable (so it's not a user code issue).
> I have a hunch that this is due to the fact that we have several users
> working under the same account (as different AWS IAM users) and we are not
> all on the same StarCluster version. To be clear, we are all on varying
> stages of the developmental version (0.9999). Where do I begin debugging
> this? The hostfile seems to be set up correctly (see output below).
>
> Thanks,
> Dan
>
> danp_at_master:~$ cat /etc/hosts
> 127.0.0.1 localhost
>
> # The following lines are desirable for IPv6 capable hosts
> ::1 ip6-localhost ip6-loopback
> fe00::0 ip6-localnet
> ff00::0 ip6-mcastprefix
> ff02::1 ip6-allnodes
> ff02::2 ip6-allrouters
> ff02::3 ip6-allhosts
> 10.196.149.155 master
> 10.226.219.58 node001
>
> And here's what the errors look like:
>
> danp_at_master:~$ qstat -f
> queuename qtype resv/used/tot. load_avg arch
> states
>
> ---------------------------------------------------------------------------------
> all.q_at_master BIP 0/0/8 1.29 lx24-amd64
>
>
> ---------------------------------------------------------------------------------
> all.q_at_node001 BIP 0/0/8 0.70 lx24-amd64
>
>
>
> ############################################################################
> - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
>
> ############################################################################
> 1 0.55500 postList danp Eqw 11/26/2012 16:54:38 1
>
> 3 0.55500 postList danp Eqw 11/26/2012 16:54:39 1
>
> 5 0.55500 postList danp Eqw 11/26/2012 16:54:39 1
>
> 7 0.55500 postList danp Eqw 11/26/2012 16:54:39 1
>
> 8 0.55500 postList danp Eqw 11/26/2012 16:54:40 1
>
> 9 0.55500 postList danp Eqw 11/26/2012 16:54:40 1
>
> 10 0.55500 postList danp Eqw 11/26/2012 16:54:40 1
>
> 11 0.55500 postList danp Eqw 11/26/2012 16:54:40 1
>
> 13 0.55500 postList danp Eqw 11/26/2012 16:54:40 1
>
> 15 0.55500 postList danp Eqw 11/26/2012 16:54:41 1
>
>
>
>
>
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>



-- 
Daniel G Polhamus, PhD
Metrum Research Group, LLC
2 Tunxis Rd, Suite 112
Tariffville, CT 06081
(888) 308-7049 ext 403
Received on Mon Nov 26 2012 - 14:45:12 EST
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject