Hi Lyn, Hi Rayson,
You were trying to help, however, I described the problem insuficiently.
I just isolated the issue. It seems the problem is not the number of jobs.
It is the use of job dependencies with wildcard in name with the -hold_jid
For example launching 10k jobs named A0000 to A9999 and then 10k jobs named
B0000 to B9999 each depending on "A*" then the system should become
unstable. However, if each job in B depends on "A0000" and several other A
jobs named explicitly without the windcard character, then SGE does not
have any issues.
I assume that when deleting the jobs, SGE tries to recalculate all jobs
depending on the related job. In the example above for each one of the 10k
A jobs deleted, the system recalculates the list of 10k dependencies for
each of the 10k jobs. This is a fair assumption considering the tests I
made. However, I did not look at SGE code to confirm my assumption is
correct. Perhaps one of the experts can verify this?
In any case, I recommend avoiding using wildcard dependencies on job names
unless absolutely necessary. In the future, perhaps qdel * can start
removing jobs according to dependency tree or recalculate dependencies once
after all deletions took place.
I hope this will help someone avoid pitfalls in the future.
On Feb 23, 2015 3:33 PM, "Jacob Barhak" <jacob.barhak_at_gmail.com> wrote:
> Thanks Lyn, Thanks Rayson,
> For those who may be reading this in the future looking for a solution,
> here is a partial solution.
> It does not reduce the time for deleting many jobs, yet it prevents the
> system from crushing multiple times in the attempt to delete the jobs.
> Here is what I did:
> while sleep 600; do timeout 480 qdel -u UserName ; done
> Just replace 600 with a safe period for the system to recover and 480 with
> approxinate time that the system runs before memory being exhausted, and
> replace UserName with your user. Those numbers will change from system to
> This will delete a chunk at a time without crushing the system. I am still
> waiting after about 9 hours, yet I did not need to restart the server due
> to SGE crushing.
> I hope this solution will help others.
> On Feb 23, 2015 2:03 AM, "Rayson Ho" <raysonlogin_at_gmail.com> wrote:
>> Is your local cluster using classic or BerkeleyDB spooling? If it is
>> classic over NFS, then qdel can be very slow.
>> One quick workaround is to hide the job spooling files manually, just
>> move the spooled jobs from $SGE_ROOT/$SGE_CELL/spool/qmaster/jobs to a
>> private backup directory.
>> Open Grid Scheduler - The Official Open Source Grid Engine
>> On Sun, Feb 22, 2015 at 8:31 PM, Jacob Barhak <jacob.barhak_at_gmail.com>
>>> Hi to SGE experts,
>>> This is an SGE question rather than StarCluster related. I am actually
>>> having this issue on a local clyster. And I did raise thulis issue a while
>>> ago. So sorry for repetition. And if you know of another list that can
>>> help, please direct me there.
>>> The qdel command does not respond well with a large number of jobs. More
>>> than 100k jobs makes things intollerable.
>>> It takes a long time and consumes too much memory if trying to delete
>>> all jobs.
>>> Is there a shortcut someone is aware of to clear the enite queue without
>>> waiting for many hours or the server running out of memory?
>>> Will removing the StarCluser server and reinstalling it work? If so,
>>> how to bypass long configuration? Are there several files that can do the
>>> trick if handled properly?
>>> I hope someone has a quick solution.
>>> StarCluster mailing list
Received on Thu Feb 26 2015 - 07:33:10 EST