StarCluster Elastic Load Balancer

This load balancer grows and shrinks a Sun Grid Engine cluster according to the length of the cluster’s job queue. When the cluster is heavily loaded and processing a long job queue, ELB can gradually add more nodes, up to the specified max_nodes, to distribute the work and improve throughput. When the queue becomes empty, ELB can remove idle nodes in order to save money. The cluster will shrink down to one node (the master), terminating all of the other nodes as they become idle.

Goals

  • To increase the size of the cluster up to max_nodes when there is a large queue of waiting jobs
  • To decrease the size of the cluster down to 1 when there are no jobs waiting, to save money
  • To elastically balance the cluster’s load deterministically, predictably, and slowly.

Usage

To use the Elastic Load Balancer, you can start it from the command line:

starcluster loadbalance cluster_tag

or

starcluster bal cluster_tag

This will start the load balancer in an infinite loop. It can be terminated by pressing CTRL-C.

Configuration

At this time, all of the parameters are stored in the file starcluster/balancer/sge/__init__.py with the appropriate descriptions. Before release, those configuration options will be moved to the standard starcluster configuration file.

Capabilities

There is a polling loop, default is 60 seconds, set in the configuration of the load balancer. Every 60 seconds, the load balancer will connect to the cluster, obtain statistics from Sun Grid Engine, and decide what to do about the job queue. ELB deals only with the queue length and active machines.

ELB does not examine the load on any of the hosts. If we had the ability to migrate jobs from an overloaded host to an idle host, then we would spend more time looking at individual hosts’ system load. However, since job migration is out of the scope of this project and starcluster in general, we do not look at system load.

This diagram illustrates the decisions that ELB will make in each loop:

../_images/balancer_decision_diagram.jpg

Operation

As mentioned before, the load balancer will loop every 60 seconds, collecting statistics to make intelligent decisions about when to add nodes.

Criteria for Adding a Node

A node will be added when all of the following criteria have been met:

  1. There are jobs in the queued waiting (SGE’s moniker is ‘qw’) state

  2. The longest queued job has been waiting for more than 15 minutes

  3. The number of nodes does not meet or exceed the maximum number of nodes

    set in the configuration file.

A user can set the number of nodes to be added per iteration. For instance, if the user wanted to add 1 node per iteration, which is standard and a recommended practice, they would set the add_nodes_per_iteration parameter to 1. If the user wanted two nodes to be added per iteration, that parameter should be set to 2, and the cluster would grow at a faster rate, consequently incurring higher charges from Amazon.com.

Criteria for Removing a Node

A node will be removed when all of the following criteria have been met:

  1. No jobs are in the queued waiting (‘qw’ state) state
  2. The node in question is idle, meaning it is not running an SGE job
  3. The node in question is not the master node
  4. The node in question has been up for more than 45 minutes past the hour.

Each node in the cluster will be analyzed in turn, and any and all nodes meeting the above criteria will be terminated in that polling loop. The entire cluster need not be idle for a node to be terminated: If Node001 is working on a job, but Node002 is idle and there are no queued waiting jobs, Node002 is a candidate for termination.

The 45 Minutes Past the Hour Rule

Since Amazon charges by the hour, we are assuming that you have already paid for a full hour of server time. It would be wasteful to turn it off the moment it becomes idle. By keeping that node up for 45 minutes, we allow for it to complete the maximum workload from the queue, and use 75% of the hour you have already paid for.

Leaving a node up for this amount of time also increases the stability of the cluster. It is detrimental to the cluster and wasteful to be continuosly adding and removing nodes.

The Process of Adding a Node

Adding a new node is a multi-stage process:

  1. Use the cluster class to start up a new node of the same instance and AMI as the other slave nodes in the cluster.
  2. Wait for that node to come up. Name it with the highest Node # available: If Node001, Node003, and Node005, are started, the next node will be Node006.
  3. Set up an /etc/hosts file on each node in the cluster, mapping the new node name to its ip address.
  4. Create a cluster user account and cluster group on the new node.
  5. Set up the /etc/exports file, creating the NFS shares for /home and sge on the master, and then exportfs so the shares are open to the slave nodes.
  6. Mount the NFS shares on the new node.
  7. Configure SGE: inform the master of the new host’s address, and inform the new host of the master, and excute the sge commands to establish communications.

The Process of Removing a Node

Removing a node is also a multi-stage process:

  1. Remove the node from SGE, so that no jobs can be sent to the node while it is in a transition period.
  2. Remove the node from the /etc/hosts file on other cluster machines.
  3. Remove the master’s nfs export to this soon-to-be-killed node. Call exportfs to cut it off.
  4. Terminate the node, sending a terminate command to Amazon through the Boto library.

Because the node is immediately removed from SGE, and it seems like SGE takes about 15 seconds between a qsub command and a node beginning execution of a job, this makes it very unlikely that a job will be started on a host as it is going down. There is a very small window of time within which this could happen.

Table Of Contents

Previous topic

Adding and Removing Nodes from StarCluster

Next topic

Hacking on StarCluster