StarCluster - Mailing List Archive

Re: Addnode SGE problem

From: Justin Riley <no email>
Date: Mon, 26 Nov 2012 16:36:08 -0500

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Thanks for reporting and I'm working on a fix:

https://github.com/jtriley/StarCluster/issues/179

~Justin

On 11/26/2012 03:11 PM, Ron Chen wrote:
> I have not used that feature before, but if the latest dev version
> does not work, then filing a bug seems to be the right thing to
> do.
>
> https://github.com/jtriley/StarCluster/issues
>
> -Ron
>
>
>
>
>
>
>
> ________________________________ From: Daniel Polhamus
> <danp_at_metrumrg.com> To: Ron Chen <ron_chen_123_at_yahoo.com> Cc:
> "starcluster_at_mit.edu" <starcluster_at_mit.edu> Sent: Monday, November
> 26, 2012 2:44 PM Subject: Re: [StarCluster] Addnode SGE problem
>
>
> Thanks Ron, that was helpful. It looks like the NFS share of my
> EBS volume isn't being setup correctly when using addnode. In my
> config file, I've set MOUNT_PATH=/data, yet when I use addnode:
>
>
>>>> Configuring NFS exports path(s):
> /home
>
> Sure enough, if I make changes using the master node in /home/danp,
> I can see those changes on node001 whereas there is no /data on
> node001.
>
> Should I create a bug report for this, or is this message
> sufficient? Dan
>
>
>
> On Mon, Nov 26, 2012 at 2:20 PM, Ron Chen <ron_chen_123_at_yahoo.com>
> wrote:
>
> Do you have the danp user on node001?
>>
>> Also, you should check the execd's messages file
>> ($SGE_ROOT/default/spool/<host>/messages) to find out why the job
>> caused errors.
>>
>> http://gridscheduler.sourceforge.net/howto/troubleshooting.html
>>
>> -Ron
>>
>>
>>
>> ________________________________ From: Daniel Polhamus
>> <danp_at_metrumrg.com> To: starcluster_at_mit.edu Sent: Monday,
>> November 26, 2012 1:55 PM Subject: [StarCluster] Addnode SGE
>> problem
>>
>>
>>
>> Hi all,
>>
>> I've run into a problem with "addnode" that I'm having a
>> difficult time diagnosing. Using the development version of
>> starcluster, when I issue a starcluster addnode, the nodes added
>> in the resulting cluster are unusable -- they result in SGE
>> errors. Jobs run on the master node, but any nodes I've added
>> are broken. If, however, I start the cluster with multiple nodes
>> then resulting nodes are all usable (so it's not a user code
>> issue). I have a hunch that this is due to the fact that we have
>> several users working under the same account (as different AWS
>> IAM users) and we are not all on the same StarCluster version.
>> To be clear, we are all on varying stages of the developmental
>> version (0.9999). Where do I begin debugging this? The hostfile
>> seems to be set up correctly (see output below).
>>
>> Thanks, Dan
>>
>> danp_at_master:~$ cat /etc/hosts 127.0.0.1 localhost
>>
>> # The following lines are desirable for IPv6 capable hosts ::1
>> ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0
>> ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters
>> ff02::3 ip6-allhosts 10.196.149.155 master 10.226.219.58 node001
>>
>> And here's what the errors look like:
>>
>> danp_at_master:~$ qstat -f queuename qtype
>> resv/used/tot. load_avg arch states
>> ---------------------------------------------------------------------------------
>>
>>
all.q_at_master BIP 0/0/8 1.29 lx24-amd64
>> ---------------------------------------------------------------------------------
>>
>>
all.q_at_node001 BIP 0/0/8 0.70 lx24-amd64
>>
>> ############################################################################
>>
>>
- - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
>> ############################################################################
>>
>>
1 0.55500 postList danp Eqw 11/26/2012 16:54:38 1
>> 3 0.55500 postList danp Eqw 11/26/2012 16:54:39 1
>> 5 0.55500 postList danp Eqw 11/26/2012 16:54:39
>> 1 7 0.55500 postList danp Eqw 11/26/2012 16:54:39
>> 1 8 0.55500 postList danp Eqw 11/26/2012 16:54:40
>> 1 9 0.55500 postList danp Eqw 11/26/2012 16:54:40
>> 1 10 0.55500 postList danp Eqw 11/26/2012 16:54:40
>> 1 11 0.55500 postList danp Eqw 11/26/2012 16:54:40
>> 1 13 0.55500 postList danp Eqw 11/26/2012 16:54:40
>> 1 15 0.55500 postList danp Eqw 11/26/2012 16:54:41
>> 1
>>
>>
>>
>>
>>
>> _______________________________________________ StarCluster
>> mailing list StarCluster_at_mit.edu
>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>
>
>

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://www.enigmail.net/

iEYEARECAAYFAlCz4MgACgkQ4llAkMfDcrln6ACfdXPkXoi4jHuWgvw1FyiOodUC
af4AnRTFAER33awFUXBBQctQo53tb+h1
=AuEu
-----END PGP SIGNATURE-----
Received on Mon Nov 26 2012 - 16:36:13 EST
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject