StarCluster - Mailing List Archive

Re: Configure Nodes to submit jobs

From: greg <no email>
Date: Wed, 1 Oct 2014 13:48:39 -0400

Hi everyone,

I found the bug. Apparently I library I installed in Python is only
available on the master node. What's a good way to install Python
libraries so it's available on all nodes? I guess virtualenv, but I'm
hoping for something simpler :-)

-Greg

On Wed, Oct 1, 2014 at 10:33 AM, Jennifer Staab <jstaab_at_cs.unc.edu> wrote:
> For software and scripts you can login to each node and check that software
> is installed and you can view/find the scripts. Also you might check and
> make sure user you qsub'ed the jobs with has correct permissions to run
> scripts and software and write output.
>
> Easier way is to use qstat -j <JOBID> where the <JOBID> is jobid of one of
> the jobs with EQW status. It and/or the .o/.e files you set when you
> submitted the qsub job will give you file location to read the error
> messages. If you didn't set an .o and .e file in your qsub call ( using -e
> and -o options) I believe it defaults to files with jobid or jobname with
> extension .o( for output) and .e ( for error). I believe Chris talked about
> this in his reply. This is how I discovered scripts weren't shared is
> because the .e file indicated the scripts couldn't be found.
>
> Also as Chris said you can do qconf to change attributes of SGE setup. I
> have done this before on a Starcluster cluster - login to master node and as
> long as you have admin privileges you can use qconf command to change
> attributes of SGE setup.
>
> Good Luck.
>
> -Jennifer
>
> Sent from my Verizon Wireless 4G LTE DROID
>
>
> greg <margeemail_at_gmail.com> wrote:
>
> Thanks Jennifer! Being completely new to star cluster, how can I
> check that my scripts are available to all nodes?
>
> -Greg
>
> On Wed, Oct 1, 2014 at 8:26 AM, Jennifer Staab <jstaab_at_cs.unc.edu> wrote:
>> Greg -
>>
>> Maybe check that your software and scripts are available to all nodes.
>> I
>> have had Starcluster throw a bunch of EQW's when I accidentally didn't
>> have
>> all the software and script components loaded in a directory that was
>> NFS'ed
>> for all the nodes of my cluster and/or individually loaded on all nodes of
>> the cluster.
>>
>> And as Chris just stated use:
>> qstat -j <JOBID> ==> gives complete info on that job
>> qstat -j <JOBID> | grep error (looks for errors in job)
>>
>> When you get the error debugged you can use:
>> qmod -cj <JOBID> (will clear error state and restart job - like Eqw )
>>
>> Good Luck.
>>
>> -Jennifer
>>
>>
>>
>> On 10/1/14 8:09 AM, Chris Dagdigian wrote:
>>>
>>> 'EQW' is a combination of multiple message states (e)(q)(w). The
>>> standard "qw" is familiar to everyone, the E indicates something bad at
>>> the job level.
>>>
>>> There are multiple levels of debugging, starting with easy and getting
>>> more cumbersome. Almost all require admin or sudo level access
>>>
>>> The 1st pass debug method is to run "qstat -j <jobID>" on the job that
>>> is in EQW state, that should provide a bit more information about what
>>> went wrong.
>>>
>>> After that you look at the .e and .o STDERR/STDOUT files from the script
>>> if any were created
>>>
>>> After that you can use sudo privs to go into
>>> $SGE_ROOT/$SGE_CELL/spool/qmaster/ and look at the messages file, there
>>> are also per-node messages files you can look at as well.
>>>
>>> The next level of debugging after that usually involves setting the
>>> sge_execd parameter KEEP_ACTIVE=true which triggers a behavior where SGE
>>> will stop deleting the temporary files associated with a job life cycle.
>>> Those files live down in the SGE spool at location
>>> <executionhost>/active.jobs/<jobID/ -- and they are invaluable in
>>> debugging nasty subtle job failures
>>>
>>> EQW should be easy to troubleshoot though - it indicates a fatal error
>>> right at the beginning of the job dispatch or execution process. No
>>> subtle things there
>>>
>>>
>>> And if your other question was about nodes being allowed to submit jobs
>>> -- yes you have to configure this. It can be done during SGE install
>>> time or any time afterwards by doing "qconf -as <nodename>" from any
>>> account with SGE admin privs. I have no idea if startcluster does this
>>> automatically or not but I'd expect that it probably does, If not it's
>>> an easy fix.
>>>
>>> -Chris
>>>
>>>
>>> greg wrote:
>>>>
>>>> Hi guys,
>>>>
>>>> I'm afraid I'm still stuck on this. Besides my original question
>>>> which I'm still not sure about. Does anyone have any general advice
>>>> on debugging an EQW state? The same software runs fine in our local
>>>> cluster.
>>>>
>>>> thanks again,
>>>>
>>>> Greg
>>>
>>> _______________________________________________
>>> StarCluster mailing list
>>> StarCluster_at_mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>
>>
Received on Wed Oct 01 2014 - 13:48:41 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject