sge job array stalling

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

From: Marc Crepeau <no email>
Date: Wed, 13 Jul 2016 16:51:36 +0000

I tried to run an SGE job array using Starcluster on a cluster of 8 c3.4xlarge instances. The program I was trying to run is a perl program called fragScaff.pl. You start it once and it creates a job array batch script which it then submits with a qsub command. The original process runs in the background, continuously monitoring the spawned processes for completion. The job array batch script (run_array.csh) looks like this:

#/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -V
COMMAND=$(head -n $SGE_TASK_ID ./join_default_params.r1.fragScaff/job_array.txt | tail -n 1)
$COMMAND

and the job_array.txt file looks like this:

perl fragScaff.pl -Q ./join_default_params.r1.fragScaff,697848,0,99 -r 1 -M 200
perl fragScaff.pl -Q ./join_default_params.r1.fragScaff,697848,100,199 -r 1 -M 200
perl fragScaff.pl -Q ./join_default_params.r1.fragScaff,697848,200,299 -r 1 -M 200
perl fragScaff.pl -Q ./join_default_params.r1.fragScaff,697848,300,399 -r 1 -M 200
.
.
.
perl fragScaff.pl -Q ./join_default_params.r1.fragScaff,697848,697800,697848 -r 1 -M 200

The qsub command looks like this:

qsub -t 1-6780 -N FSCF_nnnnnnnn -b y -l h_vmem=20G,virtual_free=20G ./join_default_params.r1.fragScaff/run_array.csh

The first problem I had was that once the job array threads started spawning the original thread, running on the master node, would be killed, apparently by the kernel. I figured maybe there was some memory resource constraint, so I edited the Starcluster config file so that the master node would not be an execution host, following the instructions here:

http://star.mit.edu/cluster/docs/0.93.3/plugins/sge.html

Then I re-launched the cluster and tried again. This time no job array jobs ran on the master node, and the original process was not killed. However after a fraction of the subprocesses (maybe several hundred) had spawned and completed the job array stalled out. All remaining jobs ended up in qw state, and a qhost command showed all the nodes idle.

Any ideas what might have happened and/or how I might diagnose the problem further?
Received on Wed Jul 13 2016 - 12:52:28 EDT

This message: [ Message body ]
Next message: Marc Crepeau: "sge job array stalling"
Previous message: Andrew Rech: "Re: Cannot foricly terminate cluster"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

This archive was generated by hypermail 2.3.0.

sge job array stalling

Search:

Sort all by:

Navigation

sge job array stalling

Search:

Sort all by:

Navigation