I've used Amazon's Auto Scaling for a few other projects. The most
common Auto Scaling use case is for scaling a web server cluster --
when the average CPU usage of the Auto Scaling Group goes above (say)
80%, then AWS automatically adds an instance to your cluster; on the
other hand when the load is below a certain threshold then AWS removes
an instance. When an ELB (Elastic Load balancer) is used together with
Auto Scaling, your users and/or REST-based clients can use the single
ELB end-point to talk to a potentially large number of instances.
In order for Auto Scaling to work for an HPC cluster (like StarCluster
running SGE), you need to define the "CloudWatch Metric Alarms" that
understand the SGE queue length. You can do it with Custom Metrics:
http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/publishingMetrics.html
The harder part is, when a new instance (or instances -- you can
define how many Auto Scaling starts each time by setting
"scaling_adjustment" to a value higher than one), the instance needs
to register itself to the qmaster. Also, with the current node mapping
(eg. node0029) the new instance would need to know its name when it
boots up. In theory you can get the current node count by querying AWS
for the number of instances with the security group name, but if some
nodes were removed then you don't get the correct name. And don't
forget that you can run into race conditions when adding nodes.
And finally, Auto Scaling can happy remove an instance that is running
jobs, while an idle instance can be left untouched. While there are
parameters that you can set to change the termination instance
behavior, they are not flexible enough:
http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/us-termination-policy.html
If you are interested in why the StarCluster load balancer was
developed, you can read Rajat Banerjee's master's thesis at:
http://www.hindoogle.com/thesis/BanerjeeR_Thesis0316.pdf
Rayson
==================================================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/
http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
On Fri, Mar 28, 2014 at 2:37 PM, Dmitry Serenbrennikov
<dmitry_at_adchemy.com> wrote:
>
> This brings up something I've been thinking about. How hard would it be to rewrite the loadbalancer to work in the context of Amazon's AWS Autoscaling instead of as its own standalone piece? I am not familiar with autoscaling yet, but seems like there is a bit of an overlap in functionality. Does that sound like a good direction?
>
>
> Also, it would be great to abstract the loadbalancer's queue sensor and make SGE a plugin. Other implementations can include Celery or perhaps ipython. Just brainstorming for now.
>
>
> ________________________________
> From: Cory Dolphin <wcdolphin_at_gmail.com>
> Sent: Friday, March 28, 2014 10:56 AM
> To: Sergio Mafra
> Cc: Dmitry Serenbrennikov; starcluster_at_mit.edu
>
> Subject: Re: [StarCluster] StarCluster Plugins
>
> Is there a central repository of user-contributed plugins anywhere? Regardless, I have both a contribution and a question:
>
> 1. Has anyone experimented with using SGE profile in an IPython cluster? I wish to enable starcluster's load balancing, but since it depends on SGE's queue length, I do not believe it will behave as expected.
> 2. I had some issues with remote engines dying during a particular job, so I edited the ipcluster plugin to use supervisord to watch the processes, if anyone else finds it useful, it is on Github with the creative name StarClusterReliableIPCluster
>
> Cheers,
> Cory
>
>
> On Mon, Mar 17, 2014 at 3:29 PM, Sergio Mafra <sergiohmafra_at_gmail.com> wrote:
>>
>> Good move Dmitry..
>>
>> If someone has more plugins who wants to share.. be welcomed here.
>>
>> All best,
>>
>> Sergio
>>
>>
>> On Mon, Mar 17, 2014 at 3:45 PM, Dmitry Serenbrennikov <dmitry_at_adchemy.com> wrote:
>>>
>>> Same here.
>>>
>>>
>>> Here's another plugin that I have written. It integrates Chef such that master nodes are configured as chef clients/nodes.
>>>
>>> https://gist.github.com/nitecoder/9605618
>>>
>>> This is a Chef plugin for StarCluster. It's really my first ever StarCluster plugin, plus I'm very new to Chef as well. So please take with a grain of salt. Any feedback is welcome! Especially, a few things are not quite to my liking: chef's authentication system seems to presume that validation.pem file is widely distributed. I'm attempting to limit the need for this file being everywhere (by u
>>> This is a Chef plugin for StarCluster. It's really my first ever StarCluster plugin, plus I'm very new to Chef as well. So please take with a grain of salt. Any feedback is welcome! Especially, a few things are not quite to my liking: chef's authentication system seems to presume that validation.pem file is widely distributed. I'm attempting to limit the need for this file being everywhere (by using chef bootstrap). It works, but doesn't seem like the right approach. Am I just misunderstanding Chef somehow? currently it's not possible to tell if starcluster node is terminated or just stopped. If one deletes chef client on stop, then start doesn't work. I chose to delete the node but leave the client. But this also doesn't seem like the right approach. - Gist is a simple way to share snippets of text and code with others.
>>> Read more...
>>> It's kind of clunky but seems to work. In particular, I've attempted to de-provision the node (if not the client) when cluster terminates. But I'm new to Chef and StarCluster both, so this could be totally missing the way Chef is supposed to operate.
>>>
>>>
>>> Any feedback is welcome!
>>>
>>>
>>> Thanks!
>>>
>>> -Dmitry
>>>
>>>
>>>
>>>
>>> ________________________________
>>> From: starcluster-bounces_at_mit.edu <starcluster-bounces_at_mit.edu> on behalf of Jacob Barhak <jacob.barhak_at_gmail.com>
>>> Sent: Sunday, March 16, 2014 9:21 PM
>>> To: Sergio Mafra
>>> Cc: starcluster_at_mit.edu
>>> Subject: Re: [StarCluster] StarCluster Plugins
>>>
>>> Thanks Sergio,
>>>
>>> You have a second there.
>>>
>>> People were very helpful to me in the past when I asked questions and they suggested plugins that solved my problems. However, I still do not understand too much about them to the point I can write one on my own, and I saw others in the list that ask similar questions about plugins.
>>>
>>> The real question is if anyone who understands enough is willing to take this effort of a centralized repository?
>>>
>>> If someone does choose to invest in this it will help for sure and may grow to be a useful tool set.
>>>
>>> Jacob
>>>
>>>
>>>
>>> On Sat, Mar 15, 2014 at 3:08 PM, Sergio Mafra <sergiohmafra_at_gmail.com> wrote:
>>>>
>>>> Hi fellows,
>>>>
>>>> I´ve got no doubt that StarCluster is a fantastic tool. This helps lots of people on creating and managing an array of clusters.
>>>> But there is more.. the plugins.. They can help us improve what is really great.
>>>> The problem is knowing how to really develop and find them. They are scattered on internet.
>>>> I think that if someone (loving soul) creates a catalog of plugins, starting with "How to develop a StarCluster plugin for dummies"... this should be something. Better if Justin and Rayson cloud enhance the StarCluster´s site (http://star.mit.edu/cluster/docs/latest/manual/plugins.html#plugin-system) to be like a micro git.
>>>> Tell me what do you think.
>>>>
>>>> Sergio
>>>>
>>>> _______________________________________________
>>>> StarCluster mailing list
>>>> StarCluster_at_mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>>
>>>
>>
>>
>> _______________________________________________
>> StarCluster mailing list
>> StarCluster_at_mit.edu
>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>
>
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
Received on Fri Mar 28 2014 - 15:25:34 EDT