StarCluster - Mailing List Archive

Problem with SGE: jobs crash after running for sometime

From: Santosh Kumar Divvala <no email>
Date: Fri, 9 Nov 2012 14:48:45 -0800

Hello,

I have recently started using starcluster for scheduling my jobs on ec2.

I am using the following command to run my jobs:
qsub -N MultiMatlab -pe orte 8 -e /home/ubuntu/outputs/ -o
/home/ubuntu/outputs/ -j y <job to run>

where <job to run> is a matlab compiled binary (generated using
http://www.mathworks.com/help/toolbox/compiler/mcc.html and run using
http://www.mathworks.com/products/compiler/mcr/index.html) that
internally uses the matlab 'parfor'
(http://www.mathworks.com/help/distcomp/parfor.html).

Although I am able to successfully schedule my jobs, many of them are
crashing after running for sometime on the nodes. (I have made sure
that I am not exceeding the amount of memory/cpu resources on each
node.)
I have included the qstat output before/after job3 on node002 crashed.
(In this case, I have started four similar/identical jobs on each of
the four nodes.)
The output of "qstat -explain a" (also included below) indicates the
error as "error: no value for 'np_load_avg' because execd is in
unknown state".
I have tried modifying the queue configuration using "qconf -mq" but
to no avail. I have included qconf output below. (set
np_load_avg=11.75 instead of the default 1.75)

I was wondering if there are any suggestions for fixing this issue.
Could you kindly let me know.

[21:55:49Fri Nov 09~]qstat -f
queuename qtype resv/used/tot. load_avg arch
    states
---------------------------------------------------------------------------------
all.q_at_node001 BIP 0/8/8 35.28 linux-x64
      2 0.55500 MultiMatla ubuntu r 11/09/2012 21:39:31 8
---------------------------------------------------------------------------------
all.q_at_node002 BIP 0/8/8 31.54 linux-x64
      3 0.55500 MultiMatla ubuntu r 11/09/2012 21:40:46 8
---------------------------------------------------------------------------------
all.q_at_node003 BIP 0/8/8 37.81 linux-x64
      4 0.55500 MultiMatla ubuntu r 11/09/2012 21:41:01 8
---------------------------------------------------------------------------------
all.q_at_node004 BIP 0/8/8 21.15 linux-x64
      5 0.55500 MultiMatla ubuntu r 11/09/2012 21:41:16 8


[21:55:51Fri Nov 09~]qstat -f
queuename qtype resv/used/tot. load_avg arch
    states
---------------------------------------------------------------------------------
all.q_at_node001 BIP 0/8/8 35.07 linux-x64
      2 0.55500 MultiMatla ubuntu r 11/09/2012 21:39:31 8
---------------------------------------------------------------------------------
all.q_at_node002 BIP 0/8/8 -NA- linux-x64 au
      3 0.55500 MultiMatla ubuntu r 11/09/2012 21:40:46 8
---------------------------------------------------------------------------------
all.q_at_node003 BIP 0/8/8 38.70 linux-x64
      4 0.55500 MultiMatla ubuntu r 11/09/2012 21:41:01 8
---------------------------------------------------------------------------------
all.q_at_node004 BIP 0/8/8 20.34 linux-x64
      5 0.55500 MultiMatla ubuntu r 11/09/2012 21:41:16 8


[21:56:46Fri Nov 09~]qstat -explain a
queuename qtype resv/used/tot. load_avg arch
    states
---------------------------------------------------------------------------------
all.q_at_node001 BIP 0/8/8 35.07 linux-x64
      2 0.55500 MultiMatla ubuntu r 11/09/2012 21:39:31 8
---------------------------------------------------------------------------------
all.q_at_node002 BIP 0/8/8 -NA- linux-x64 au
 error: no value for "np_load_avg" because execd is in unknown state
      3 0.55500 MultiMatla ubuntu r 11/09/2012 21:40:46 8
---------------------------------------------------------------------------------
all.q_at_node003 BIP 0/8/8 38.70 linux-x64
      4 0.55500 MultiMatla ubuntu r 11/09/2012 21:41:01 8
---------------------------------------------------------------------------------
all.q_at_node004 BIP 0/8/8 20.34 linux-x64
      5 0.55500 MultiMatla ubuntu r 11/09/2012 21:41:16 8

root_at_master:~# qconf -mq all.q
qname all.q
hostlist _at_allhosts
seq_no 0
load_thresholds np_load_avg=11.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list make orte
rerun FALSE
slots 1,[node001=8],[node002=8],[node003=8],[node004=8]
tmpdir /tmp
shell /bin/bash
prolog NONE
epilog NONE
shell_start_mode posix_compliant

thanks,
~Santosh
Received on Fri Nov 09 2012 - 17:49:06 EST
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject