This archive was generated by
Hi to SGE experts,
This is less of a StarCluster specific issue. It is more of an SGE issue I encountered and was hoping someone here can help with.
My system runs many many smaller jobs - tens of thousands. When I need a rushed solution to reach a deadline I use StarCluster. However, if I have time, I run the simulations on a single 8 core machine that has SGE installed over Ubuntu 12.04. This machine is new and fast with SSD drive and freshly installed yet I am encountering issues.
1. When I launch about 70k jobs submitting a single new job to the queue takes a while - about a second or so, compared to fractions of a second when the queue is empty.
2. Deleting all the jobs from the queue using qdel -u username takes a long time. It reports about 24 deletes to the screen every few seconds - at this rate it will take hours to delete the entire queue. It is still deleting while I am writing these words. Way too much time.
3. The system was working ok for a few days yet now has trouble with qmaster. It report the following:
error: commlib error: got select error (connection refused)
unable to send message to qmaster using port 6444 on host 'localhost': got send error.
Also qmon reported cannot reach qmaster. I had to restart and suspend and disable the queue.
Note that qstat -j currently reports:
All queues dropped because of overload or full
Note that I configured the schedule interval to 2 seconds since many of my jobs are so fast that even 2 seconds is very inefficient for them yet some are longer and memory consuming so I cannot allow more slots to launch too many jobs.
Am I overloading the system with too many jobs? What is the limit on a single strong machine? How will this scale when I run this on StarCluster?
Any advice on how to efficiently handle many jobs, some of which are very short, will be appreciated. And I hope this interests the audience.
Sent from my iPhone
Received on Sun Mar 30 2014 - 02:37:48 EDT