Re: workers go idle until a new worker is added ... ?

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

From: Mike Cariaso <no email>
Date: Mon, 22 Aug 2016 17:14:59 +0000

since 'qstat -j 2' shows me

...

error reason 8: 08/22/2016 15:31:35 [1000:44925]: unable to find job file "/opt/sge6/default/spool/exec_spool_local/mynew1-node001/job_scripts/2"
error reason 9: 08/22/2016 15:31:35 [1000:44926]: unable to find job file "/opt/sge6/default/spool/exec_spool_local/mynew1-node001/job_scripts/2"

this sounds a *lot* like the race condition described at

https://confluence.si.edu/display/HPC/Job+Arrays#JobArrays-ParallelJobArrays

and

http://users.gridengine.sunsource.narkive.com/66KtbRva/sporadic-errors-in-array-tasks-with-a-pe

but adding '-b yes' doesn't seem to fix the problem. (there were no

embedded SGE options in my scriptfile)

Has anyone else encountered this? Found a work around?

also fwiw:

less /opt/sge6/default/spool/exec_spool_local/mynew1-node001/messages

08/22/2016 15:31:36| main|mynew1-node001|E|shepherd of job 2.8 exited with exit status = 11
08/22/2016 15:31:36| main|mynew1-node001|C|exec of mailer "/bin/mail" failed: "No such file or directory"
08/22/2016 15:31:36| main|mynew1-node001|E|shepherd of job 2.9 exited with exit status = 11

Michael Cariaso
<mailto:michael.cariaso_at_keygene.com>
Bioinformatician<http://www.keygene.com>
________________________________
From: starcluster-bounces_at_mit.edu <starcluster-bounces_at_mit.edu> on behalf of Mike Cariaso <mike.cariaso_at_keygene.com>
Sent: Tuesday, August 23, 2016 12:13 AM
To: starcluster_at_mit.edu
Subject: [StarCluster] workers go idle until a new worker is added ... ?

using the latest version from

https://github.com/datacratic<https://github.com/datacratic/StarCluster/blob/vanilla_improvements/starcluster/plugins/sge.py>

I start a master node, and zero workers, and put an array job into the queue. I then then gradually add workers nodes. A new worker accepts as many tasks as the slots allow, but after they complete it never picks up additional work. When I add a new worker machine, it accepts some tasks and runs them successfully, but never goes back for more. Usually during this time one of the idle previous machines will also pickup some more tasks, but once those are finished it again sits waiting.

qstat -j 1.19 shows me 'unable to find job file "/opt/sge6/default/spool/exec_spool_local/mynew1-node002/job_scripts/1"'

and it's true that no file is there. When I add a new machine, the job appears, suggesting this isn't a file permission issue.

some nodes remain out of action.

starcluster addnode -x -a nodename clustername

doesn't seem to help.

Michael Cariaso
<mailto:michael.cariaso_at_keygene.com>
Bioinformatician<http://www.keygene.com>
Received on Mon Aug 22 2016 - 13:15:03 EDT

This message: [ Message body ]
Next message: Mike Cariaso: "Re: workers go idle until a new worker is added ... solved"
Previous message: Mike Cariaso: "workers go idle until a new worker is added ... ?"
In reply to: Mike Cariaso: "workers go idle until a new worker is added ... ?"
Next in thread: Mike Cariaso: "Re: workers go idle until a new worker is added ... solved"
Reply: Mike Cariaso: "Re: workers go idle until a new worker is added ... solved"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

This archive was generated by hypermail 2.3.0.

Re: workers go idle until a new worker is added ... ?

Search:

Sort all by:

Navigation

Re: workers go idle until a new worker is added ... ?

Search:

Sort all by:

Navigation