Hi all,
I've run into a problem with "addnode" that I'm having a difficult time
diagnosing. Using the development version of starcluster, when I issue a
starcluster addnode, the nodes added in the resulting cluster are unusable
-- they result in SGE errors. Jobs run on the master node, but any nodes
I've added are broken. If, however, I start the cluster with multiple
nodes then resulting nodes are all usable (so it's not a user code issue).
I have a hunch that this is due to the fact that we have several users
working under the same account (as different AWS IAM users) and we are not
all on the same StarCluster version. To be clear, we are all on varying
stages of the developmental version (0.9999). Where do I begin debugging
this? The hostfile seems to be set up correctly (see output below).
Thanks,
Dan
danp_at_master:~$ cat /etc/hosts
127.0.0.1 localhost
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
10.196.149.155 master
10.226.219.58 node001
And here's what the errors look like:
danp_at_master:~$ qstat -f
queuename qtype resv/used/tot. load_avg arch
states
---------------------------------------------------------------------------------
all.q_at_master BIP 0/0/8 1.29 lx24-amd64
---------------------------------------------------------------------------------
all.q_at_node001 BIP 0/0/8 0.70 lx24-amd64
############################################################################
- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
1 0.55500 postList danp Eqw 11/26/2012 16:54:38 1
3 0.55500 postList danp Eqw 11/26/2012 16:54:39 1
5 0.55500 postList danp Eqw 11/26/2012 16:54:39 1
7 0.55500 postList danp Eqw 11/26/2012 16:54:39 1
8 0.55500 postList danp Eqw 11/26/2012 16:54:40 1
9 0.55500 postList danp Eqw 11/26/2012 16:54:40 1
10 0.55500 postList danp Eqw 11/26/2012 16:54:40 1
11 0.55500 postList danp Eqw 11/26/2012 16:54:40 1
13 0.55500 postList danp Eqw 11/26/2012 16:54:40 1
15 0.55500 postList danp Eqw 11/26/2012 16:54:41 1
Received on Mon Nov 26 2012 - 13:55:39 EST
This archive was generated by
hypermail 2.3.0.