StarCluster - Mailing List Archive

sge job array stalling

From: Marc Crepeau <no email>
Date: Thu, 14 Jul 2016 22:04:09 +0000

Well, I did a little investigation and learned a bit more about what’s going on. First, I learned that it’s necessary to tell SGE how much memory each node has, and to make memory a “consumable resource”.

I did this using these commands:

qconf -rattr exechost complex_values slots=16,num_proc=16,h_vmem=31497297920 master
qconf -rattr exechost complex_values slots=16,num_proc=16,h_vmem=31497297920 node001
qconf -rattr exechost complex_values slots=16,num_proc=16,h_vmem=31497297920 node002
qconf -rattr exechost complex_values slots=16,num_proc=16,h_vmem=31497297920 node003
qconf -rattr exechost complex_values slots=16,num_proc=16,h_vmem=31497297920 node004
qconf -rattr exechost complex_values slots=16,num_proc=16,h_vmem=31497297920 node005
qconf -rattr exechost complex_values slots=16,num_proc=16,h_vmem=31497297920 node006
qconf -rattr exechost complex_values slots=16,num_proc=16,h_vmem=31497297920 node007

and by using the qconf -mc command to add the 2nd “YES” in the following line in the configurations:

h_vmem h_vmem MEMORY <= YES YES 2g 0

After that, sge and the kernel stopped killing my processes, so that was good. I was able to run processes on all the slave nodes plus the master.

But then the job array stalled out again. I used qstat -j and qacct -j to investigate why. I turns out that the disk space was full. That was a surprise to be because the c3.4xlarge instances are supposed to have 160Gb storage, which should me more than enough. It turns out they do have 160Gb, but in a different partition than the one my job array was using. This is what I got with a df -h command:

Filesystem Size Used Avail Use% Mounted on
/dev/xvda1 9.9G 9.4G 12M 100% /
udev 15G 8.0K 15G 1% /dev
tmpfs 5.9G 176K 5.9G 1% /run
none 5.0M 0 5.0M 0% /run/lock
none 15G 0 15G 0% /run/shm
/dev/xvdaa 151G 188M 143G 1% /mnt

When I run my program from /home/sgeadmin/ it seems to use the small partition (root /). I tried to get it to use the larger /mnt partition by running the program from /mnt/sgeadmin but the job array didn’t work. The qsub command was issued but all the jobs went to qw state and stayed there. This is what qstat -j reported at that point:

job-array tasks: 1-6979:1
scheduling info: queue instance "all.q_at_node003" dropped because it is temporarily not available
                            queue instance "all.q_at_node007" dropped because it is temporarily not available
                            queue instance "all.q_at_node001" dropped because it is temporarily not available
                            queue instance "all.q_at_node005" dropped because it is temporarily not available
                            queue instance "all.q_at_node002" dropped because it is temporarily not available
                            queue instance "all.q_at_node006" dropped because it is temporarily not available
                            queue instance "all.q_at_master" dropped because it is temporarily not available
                            queue instance "all.q_at_node004" dropped because it is temporarily not available
                            All queues dropped because of overload or full
                            not all array task may be started due to 'max_aj_instances'


How can I get the program to use the larger partition so that it doesn’t run out of disk space?

Below is my original message for context:

********************************************************************************************************************************************************************************************************************************************

I tried to run an SGE job array using Starcluster on a cluster of 8 c3.4xlarge instances. The program I was trying to run is a perl program called fragScaff.pl. You start it once and it creates a job array batch script which it then submits with a qsub command. The original process runs in the background, continuously monitoring the spawned processes for completion. The job array batch script (run_array.csh) looks like this:

#/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -V
COMMAND=$(head -n $SGE_TASK_ID ./join_default_params.r1.fragScaff/job_array.txt | tail -n 1)
$COMMAND

and the job_array.txt file looks like this:

perl fragScaff.pl -Q ./join_default_params.r1.fragScaff,697848,0,99 -r 1 -M 200
perl fragScaff.pl -Q ./join_default_params.r1.fragScaff,697848,100,199 -r 1 -M 200
perl fragScaff.pl -Q ./join_default_params.r1.fragScaff,697848,200,299 -r 1 -M 200
perl fragScaff.pl -Q ./join_default_params.r1.fragScaff,697848,300,399 -r 1 -M 200
.
.
.
perl fragScaff.pl -Q ./join_default_params.r1.fragScaff,697848,697800,697848 -r 1 -M 200

The qsub command looks like this:

qsub -t 1-6780 -N FSCF_nnnnnnnn -b y -l h_vmem=20G,virtual_free=20G ./join_default_params.r1.fragScaff/run_array.csh

The first problem I had was that once the job array threads started spawning the original thread, running on the master node, would be killed, apparently by the kernel. I figured maybe there was some memory resource constraint, so I edited the Starcluster config file so that the master node would not be an execution host, following the instructions here:

http://star.mit.edu/cluster/docs/0.93.3/plugins/sge.html

Then I re-launched the cluster and tried again. This time no job array jobs ran on the master node, and the original process was not killed. However after a fraction of the subprocesses (maybe several hundred) had spawned and completed the job array stalled out. All remaining jobs ended up in qw state, and a qhost command showed all the nodes idle.

Any ideas what might have happened and/or how I might diagnose the problem further?
Received on Thu Jul 14 2016 - 18:05:13 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject