Hello list,
Here is my thesis describing my work on Elastic Load Balancing in
StarCluster. Many thanks to Justin Riley for his help in getting this done.
The entire PDF is located at:
http://www.hindoogle.com/thesis/BanerjeeR_Thesis0316.pdf
It is 71 pages long.
Here is the abstract:
Abstract
Computing in the cloud provides companies and colleges a new way to perform
sophisticated computational tasks. Amazon.com, Inc. (Amazon) is the leading
provider of cloud infrastructure, and their solutions are used by thousands
of companies, universities and individuals. Amazon’s service, dubbed Elastic
Computing Cloud (EC2) allows users to rent servers by the hour, so that
computing power can be increased and decreased as needed. It eliminates the
need for companies to build and maintain expensive data centers. Instead
customers can rent servers to perform tasks as needed, and turn them off
when the tasks are completed.
The ability to quickly add and remove computing capacity enables users to
scale computing capacity in business and academic settings alike. When one
needs to perform sophisticated calculations, process large data sets, or
serve many concurrent clients, having more computing power improves
throughput and responsiveness of the system. Tasks can be completed in less
time and client requests can be served faster. In a traditional environment
where a company or university builds and maintains every server in its data
center, it takes days or even weeks to add new computing capacity, and costs
a significant amount of money. Amazon EC2 allows for instant addition and
removal of capacity, and their services are reasonably priced. A new server
can be available in as little as five minutes and can then be terminated at
any time. Server usage is billed by the hour, so users pay only for the
hours they use. This flexibility, coupled with Amazon’s low prices, is a
boon to anyone who needs to perform complex computational tasks for short or
unpredictable time periods.
The need for enormous amounts of computing power for short periods of time
is a common characteristic of scientists performing High Performance
Computing (HPC). HPC tasks are crucially important to modern science and can
range from the modeling of microscopic molecular interactions in a protein
to a nuclear weapon simulation. Before the availability of cloud computing
resources, HPC users ran their computational tasks almost exclusively on
very expensive supercomputers, which can cost in excess of $500 per hour and
must be reserved ahead of time. These supercomputers are installed at many
major universities, corporations, and research laboratories, but are not
easily accessible because of their high cost. The recent installation of
IBM’s Roadrunner supercomputer at Los Alamos National Laboratories in New
Mexico cost over $133 million.
With program decomposition techniques, scientists can break up seemingly
intractable problems into smaller, more manageable subtasks that run
independently. The problem can be solved by these extremely powerful
supercomputers, which distribute the subtasks among the many discrete
processors within the supercomputer. The processors have speedy
communication channels between them that offer plenty of bandwidth. When
discrete subtasks within the larger problem need to share information, such
as the attractive charges emitted by a molecule in a protein folding
simulation, that information is sent fast and frequently over the
inter-processor communication links. Protein Folding simulations are
particularly well suited toward parallelization because small parts of the
molecule can be simulated independently, and then the individual results can
be used to find the ideal structure of the complete protein. Parallelized
problems like this can be solved by powerful, expensive supercomputers, or
can be solved in a cluster of computers that are cheaper and more readily
available. Some problems
have unique requirements, like continuous single-threaded access to a
high-powered processor, and those problems are out of the scope of this
project.
A project called StarCluster brings the flexibility and low cost of
clustered, cloud computing to scientists and other users of High Performance
Computers. Users can launch a cluster of Amazon EC2 servers, also called
instances, through StarCluster and have a fully configured, ready to use
computational cluster online in less than ten minutes, for as little as
$0.08 per instance per hour. No reservations are required and a cluster of
up to 20 machines can be launched at any time the user desires.
StarCluster has made high performance computing in the cloud an affordable
reality to many scientists who do not have access to expensive
supercomputers. StarCluster, which is free, has approximately 500 users
worldwide, most of whom are in academia. Using StarCluster incurs no
additional fees beyond the nominal cost of per hour usage of EC2.
StarCluster is a superb product for scientists who need supercomputing
power, and who know how much time and computational resources they need to
complete the tasks.
Despite its many strengths, StarCluster does not easily adapt to changing
workloads. This type of adaptability in the cloud is called elasticity. In
StarCluster, when a cluster of instances is launched, the scientist must
specify how many instances he or she wants. Those instances are launched
together, and can only be terminated together. Instances cannot be
terminated individually, even if one instance is idle. In some situations it
is impossible to predict the workload of a cluster, such as when a scientist
overestimates the duration of a task, or data processing runs faster than
expected because an unexpected network upgrade transfers files faster. There
are many reasons that a task could complete faster or slower than expected.
It is a waste of money, in fees paid to Amazon, and a waste of energy, to
keep many idle instances running indefinitely.
This project, Elastic Load Balancing in EC2, aims to address this weakness
in StarCluster by adding an Elastic Load Balancer to the project. The
Elastic Load Balancer (ELB) will add instances to the cluster to improve job
throughput when the cluster is heavily loaded, and terminate instances when
they are idle to save money and energy. The ELB will periodically poll the
cluster, analyze its workload, decide if the cluster needs to be modified,
and add or remove instances. Through this process, StarCluster will maximize
job throughput at busy times and save money at idle times.
Several powerful Elastic Load Balancers are commercially available for Cloud
and EC2 software setups, but StarCluster’s ELB is the only one specifically
targeted toward the High Performance Computing domain. Existing ELB
implementations are geared toward web server and application server
environments and will be discussed in the Prior Work section. HPC jobs have
a unique computing profile, have long running jobs and seldom serve external
clients. This HPC computing profile mandates a new Elastic Load Balancing
strategy.
Any comments or questions are welcome. Best,
Rajat
Received on Sun Apr 03 2011 - 09:28:39 EDT