Re: Configure Nodes to submit jobs

From: Chris Dagdigian
Date: Wed, 01 Oct 2014 08:09:13 -0400

'EQW' is a combination of multiple message states (e)(q)(w). The
standard "qw" is familiar to everyone, the E indicates something bad at
the job level.

There are multiple levels of debugging, starting with easy and getting
more cumbersome. Almost all require admin or sudo level access

The 1st pass debug method is to run "qstat -j <jobID>" on the job that
is in EQW state, that should provide a bit more information about what
went wrong.

After that you look at the .e and .o STDERR/STDOUT files from the script
if any were created

After that you can use sudo privs to go into
$SGE_ROOT/$SGE_CELL/spool/qmaster/ and look at the messages file, there
are also per-node messages files you can look at as well.

The next level of debugging after that usually involves setting the
sge_execd parameter KEEP_ACTIVE=true which triggers a behavior where SGE
will stop deleting the temporary files associated with a job life cycle.
Those files live down in the SGE spool at location
<executionhost>/<jobID/ -- and they are invaluable in
debugging nasty subtle job failures

EQW should be easy to troubleshoot though - it indicates a fatal error
right at the beginning of the job dispatch or execution process. No
subtle things there

And if your other question was about nodes being allowed to submit jobs
-- yes you have to configure this. It can be done during SGE install
time or any time afterwards by doing "qconf -as <nodename>" from any
account with SGE admin privs. I have no idea if startcluster does this
automatically or not but I'd expect that it probably does, If not it's
an easy fix.


greg wrote:
> Hi guys,
> I'm afraid I'm still stuck on this. Besides my original question
> which I'm still not sure about. Does anyone have any general advice
> on debugging an EQW state? The same software runs fine in our local
> cluster.
> thanks again,
> Greg
