SGE slow with 50k+ jobs

From: Jacob Barhak <no email>
Date: Sun, 30 Mar 2014 01:37:45 -0500

Hi to SGE experts,

This is less of a StarCluster specific issue. It is more of an SGE issue I encountered and was hoping someone here can help with.

My system runs many many smaller jobs - tens of thousands. When I need a rushed solution to reach a deadline I use StarCluster. However, if I have time, I run the simulations on a single 8 core machine that has SGE installed over Ubuntu 12.04. This machine is new and fast with SSD drive and freshly installed yet I am encountering issues.

1. When I launch about 70k jobs submitting a single new job to the queue takes a while - about a second or so, compared to fractions of a second when the queue is empty.

2. Deleting all the jobs from the queue using qdel -u username takes a long time. It reports about 24 deletes to the screen every few seconds - at this rate it will take hours to delete the entire queue. It is still deleting while I am writing these words. Way too much time.

3. The system was working ok for a few days yet now has trouble with qmaster. It report the following:
error: commlib error: got select error (connection refused)
unable to send message to qmaster using port 6444 on host 'localhost': got send error.
Also qmon reported cannot reach qmaster. I had to restart and suspend and disable the queue.

Note that qstat -j currently reports:
All queues dropped because of overload or full

Note that I configured the schedule interval to 2 seconds since many of my jobs are so fast that even 2 seconds is very inefficient for them yet some are longer and memory consuming so I cannot allow more slots to launch too many jobs.

Am I overloading the system with too many jobs? What is the limit on a single strong machine? How will this scale when I run this on StarCluster?

Any advice on how to efficiently handle many jobs, some of which are very short, will be appreciated. And I hope this interests the audience.


