StarCluster - Mailing List Archive

Re: Configure Nodes to submit jobs

From: greg <no email>
Date: Wed, 1 Oct 2014 10:23:49 -0400

Ok, it looks like only /home is shared according to the docs. I
wonder why df -h isn't showing it.

On Wed, Oct 1, 2014 at 10:20 AM, greg <margeemail_at_gmail.com> wrote:
> Thanks. So I'm a bit confused about what's mounted where but it
> appears /usr/local isn't shared?
>
> root_at_master:~# df -h
> Filesystem Size Used Avail Use% Mounted on
> /dev/xvda1 7.9G 2.8G 4.8G 37% /
> none 4.0K 0 4.0K 0% /sys/fs/cgroup
> udev 3.7G 8.0K 3.7G 1% /dev
> tmpfs 752M 196K 752M 1% /run
> none 5.0M 0 5.0M 0% /run/lock
> none 3.7G 0 3.7G 0% /run/shm
> none 100M 0 100M 0% /run/user
> /dev/xvdaa 414G 199M 393G 1% /mnt
> root_at_master:~# touch /usr/local/testfile
> root_at_master:~# ls /usr/local/
> bin etc games include lib man sbin share src test file <-
> here's my test file
> root_at_master:~# ssh node001
> root_at_node001:~# ls /usr/local
> bin etc games include lib man sbin share src <- test file is missing!
>
> On Wed, Oct 1, 2014 at 9:56 AM, Arman Eshaghi <arman.eshaghi_at_gmail.com> wrote:
>> Please have a look at this http://linux.die.net/man/1/qconf or run
>> "man qconf" command
>>
>> to check if the scripts are available to a given host you may run
>> command "df -h". The output will show you which paths are mounted from
>> an external host (your master node). If this is not the case maybe you
>> can move script to the shared folders.
>>
>> All the best,
>> Arman
>>
>>
>> On Wed, Oct 1, 2014 at 5:10 PM, greg <margeemail_at_gmail.com> wrote:
>>> Thanks Chris! I'll try those debugging techniques.
>>>
>>> So running "qconf -as <nodename>" turns that node into a job submitter?
>>>
>>> -Greg
>>>
>>> On Wed, Oct 1, 2014 at 8:09 AM, Chris Dagdigian <dag_at_bioteam.net> wrote:
>>>>
>>>> 'EQW' is a combination of multiple message states (e)(q)(w). The
>>>> standard "qw" is familiar to everyone, the E indicates something bad at
>>>> the job level.
>>>>
>>>> There are multiple levels of debugging, starting with easy and getting
>>>> more cumbersome. Almost all require admin or sudo level access
>>>>
>>>> The 1st pass debug method is to run "qstat -j <jobID>" on the job that
>>>> is in EQW state, that should provide a bit more information about what
>>>> went wrong.
>>>>
>>>> After that you look at the .e and .o STDERR/STDOUT files from the script
>>>> if any were created
>>>>
>>>> After that you can use sudo privs to go into
>>>> $SGE_ROOT/$SGE_CELL/spool/qmaster/ and look at the messages file, there
>>>> are also per-node messages files you can look at as well.
>>>>
>>>> The next level of debugging after that usually involves setting the
>>>> sge_execd parameter KEEP_ACTIVE=true which triggers a behavior where SGE
>>>> will stop deleting the temporary files associated with a job life cycle.
>>>> Those files live down in the SGE spool at location
>>>> <executionhost>/active.jobs/<jobID/ -- and they are invaluable in
>>>> debugging nasty subtle job failures
>>>>
>>>> EQW should be easy to troubleshoot though - it indicates a fatal error
>>>> right at the beginning of the job dispatch or execution process. No
>>>> subtle things there
>>>>
>>>>
>>>> And if your other question was about nodes being allowed to submit jobs
>>>> -- yes you have to configure this. It can be done during SGE install
>>>> time or any time afterwards by doing "qconf -as <nodename>" from any
>>>> account with SGE admin privs. I have no idea if startcluster does this
>>>> automatically or not but I'd expect that it probably does, If not it's
>>>> an easy fix.
>>>>
>>>> -Chris
>>>>
>>>>
>>>> greg wrote:
>>>>> Hi guys,
>>>>>
>>>>> I'm afraid I'm still stuck on this. Besides my original question
>>>>> which I'm still not sure about. Does anyone have any general advice
>>>>> on debugging an EQW state? The same software runs fine in our local
>>>>> cluster.
>>>>>
>>>>> thanks again,
>>>>>
>>>>> Greg
>>> _______________________________________________
>>> StarCluster mailing list
>>> StarCluster_at_mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/starcluster
Received on Wed Oct 01 2014 - 10:24:48 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject