StarCluster - Mailing List Archive

Re: Configure Nodes to submit jobs

From: greg <no email>
Date: Wed, 1 Oct 2014 09:38:56 -0400

Thanks Jennifer! Being completely new to star cluster, how can I
check that my scripts are available to all nodes?

-Greg

On Wed, Oct 1, 2014 at 8:26 AM, Jennifer Staab <jstaab_at_cs.unc.edu> wrote:
> Greg -
>
> Maybe check that your software and scripts are available to all nodes. I
> have had Starcluster throw a bunch of EQW's when I accidentally didn't have
> all the software and script components loaded in a directory that was NFS'ed
> for all the nodes of my cluster and/or individually loaded on all nodes of
> the cluster.
>
> And as Chris just stated use:
> qstat -j <JOBID> ==> gives complete info on that job
> qstat -j <JOBID> | grep error (looks for errors in job)
>
> When you get the error debugged you can use:
> qmod -cj <JOBID> (will clear error state and restart job - like Eqw )
>
> Good Luck.
>
> -Jennifer
>
>
>
> On 10/1/14 8:09 AM, Chris Dagdigian wrote:
>>
>> 'EQW' is a combination of multiple message states (e)(q)(w). The
>> standard "qw" is familiar to everyone, the E indicates something bad at
>> the job level.
>>
>> There are multiple levels of debugging, starting with easy and getting
>> more cumbersome. Almost all require admin or sudo level access
>>
>> The 1st pass debug method is to run "qstat -j <jobID>" on the job that
>> is in EQW state, that should provide a bit more information about what
>> went wrong.
>>
>> After that you look at the .e and .o STDERR/STDOUT files from the script
>> if any were created
>>
>> After that you can use sudo privs to go into
>> $SGE_ROOT/$SGE_CELL/spool/qmaster/ and look at the messages file, there
>> are also per-node messages files you can look at as well.
>>
>> The next level of debugging after that usually involves setting the
>> sge_execd parameter KEEP_ACTIVE=true which triggers a behavior where SGE
>> will stop deleting the temporary files associated with a job life cycle.
>> Those files live down in the SGE spool at location
>> <executionhost>/active.jobs/<jobID/ -- and they are invaluable in
>> debugging nasty subtle job failures
>>
>> EQW should be easy to troubleshoot though - it indicates a fatal error
>> right at the beginning of the job dispatch or execution process. No
>> subtle things there
>>
>>
>> And if your other question was about nodes being allowed to submit jobs
>> -- yes you have to configure this. It can be done during SGE install
>> time or any time afterwards by doing "qconf -as <nodename>" from any
>> account with SGE admin privs. I have no idea if startcluster does this
>> automatically or not but I'd expect that it probably does, If not it's
>> an easy fix.
>>
>> -Chris
>>
>>
>> greg wrote:
>>>
>>> Hi guys,
>>>
>>> I'm afraid I'm still stuck on this. Besides my original question
>>> which I'm still not sure about. Does anyone have any general advice
>>> on debugging an EQW state? The same software runs fine in our local
>>> cluster.
>>>
>>> thanks again,
>>>
>>> Greg
>>
>> _______________________________________________
>> StarCluster mailing list
>> StarCluster_at_mit.edu
>> http://mailman.mit.edu/mailman/listinfo/starcluster
>
>
Received on Wed Oct 01 2014 - 09:39:00 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject