Im having a couple issues with my load balancer, that I'm not sure about. My cluster doesn't seem to be expanding despite jobs being in queue for hours (making me skeptical of the statistics reported by the balancer). Secondly, when running for a long time the load balancer seems to crash with an internal error:
!!! ERROR - InternalError: An internal error has occurred
StarCluster - (http://web.mit.edu/starcluster
) (v. 0.93)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster_at_mit.edu
I'm running the balancer like this:
nohup starcluster bal rnaseq -d -i 60 -m 20 -n 2 -w 300 -s 0 -l 5000 >> loadbalancer.rnaseq.log 2>&1 &
The logged output generally looks like:
>>> Loading full job history
Cluster size: 2
Queued jobs: 2
Oldest queued job: 2012-01-25 07:48:40
Avg job duration: 489 secs
Avg job wait time: 782 secs
Last cluster modification time: 2012-01-25 06:11:23
>>> Sleeping...(looping again in 60 secs)
Neither of those avg job durations or wait times are particularly close to reality, but that could just be because not enough data has been collected yet.
I didn't think it would be relevant, because as I understand the balancer setup it monitors whether things are in the queue not whether there are slots available, but I will mention that I made a modification to the sge setup for the master node. It is an m1.xlarge (4 core) instance, but I modified the qconf to only allow 2 slots to be allocated. Therefore, many (most) of the queued jobs, which request 4 cores, are unable to execute on the head node. Often this means that there are queued jobs despite being 2 slots available for execution on that head node. If the loadbalancer is looking at the wait time of queued jobs and in no way using available slots, this shouldnt' matter. I assume it doesn't, and am just mentioning it here in case my assumption is wrong. :)
Lastly, if I manually add nodes to the cluster, it will shrink them down to the minimum as time passes as expected.
Does anyone have any thoughts on what I'm doing wrong here?
Received on Thu Jan 26 2012 - 12:15:49 EST