StarCluster - Mailing List Archive

Re: Configure Nodes to submit jobs

From: Jennifer Staab <no email>
Date: Wed, 01 Oct 2014 08:26:13 -0400

Greg -

    Maybe check that your software and scripts are available to all
nodes. I have had Starcluster throw a bunch of EQW's when I
accidentally didn't have all the software and script components loaded
in a directory that was NFS'ed for all the nodes of my cluster and/or
individually loaded on all nodes of the cluster.

And as Chris just stated use:
qstat -j <JOBID> ==> gives complete info on that job
qstat -j <JOBID> | grep error (looks for errors in job)

When you get the error debugged you can use:
qmod -cj <JOBID> (will clear error state and restart job - like Eqw )

Good Luck.

-Jennifer


On 10/1/14 8:09 AM, Chris Dagdigian wrote:
> 'EQW' is a combination of multiple message states (e)(q)(w). The
> standard "qw" is familiar to everyone, the E indicates something bad at
> the job level.
>
> There are multiple levels of debugging, starting with easy and getting
> more cumbersome. Almost all require admin or sudo level access
>
> The 1st pass debug method is to run "qstat -j <jobID>" on the job that
> is in EQW state, that should provide a bit more information about what
> went wrong.
>
> After that you look at the .e and .o STDERR/STDOUT files from the script
> if any were created
>
> After that you can use sudo privs to go into
> $SGE_ROOT/$SGE_CELL/spool/qmaster/ and look at the messages file, there
> are also per-node messages files you can look at as well.
>
> The next level of debugging after that usually involves setting the
> sge_execd parameter KEEP_ACTIVE=true which triggers a behavior where SGE
> will stop deleting the temporary files associated with a job life cycle.
> Those files live down in the SGE spool at location
> <executionhost>/active.jobs/<jobID/ -- and they are invaluable in
> debugging nasty subtle job failures
>
> EQW should be easy to troubleshoot though - it indicates a fatal error
> right at the beginning of the job dispatch or execution process. No
> subtle things there
>
>
> And if your other question was about nodes being allowed to submit jobs
> -- yes you have to configure this. It can be done during SGE install
> time or any time afterwards by doing "qconf -as <nodename>" from any
> account with SGE admin privs. I have no idea if startcluster does this
> automatically or not but I'd expect that it probably does, If not it's
> an easy fix.
>
> -Chris
>
>
> greg wrote:
>> Hi guys,
>>
>> I'm afraid I'm still stuck on this. Besides my original question
>> which I'm still not sure about. Does anyone have any general advice
>> on debugging an EQW state? The same software runs fine in our local
>> cluster.
>>
>> thanks again,
>>
>> Greg
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
Received on Wed Oct 01 2014 - 08:26:22 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject