StarCluster - Mailing List Archive

Re: Large cluster (125 nodes) launch failure

From: Joseph <Kyeong>
Date: Sat, 26 Mar 2011 11:33:40 +0000

Matthew,

Thanks for the great information.
I am interested in this topic, of course (see my response to Rajat).

By the way, before diving into the world of StarCluster, I was
seriously considering the Condor from U. of Wisconsin as well, but not
sure whether the Condor is right for parallel jobs based on MPI and so
on, which was the main reason for me to choose the StarCluster.

For my current network simulation experiments, about 90% of them are
just traditional sequential tasks, but still 10% or so (for really
large configurations) need parallel programming based on MPI.

Can you comment on your experience of Celery queue for any parallel tasks?

Regards,
Joseph

On Fri, Mar 25, 2011 at 5:06 PM, Matthew Summers
<matthew.summers_at_liquidustech.com> wrote:
> On Fri, Mar 25, 2011 at 3:40 PM, Kyeong Soo (Joseph) Kim
> <kyeongsoo.kim_at_gmail.com> wrote:
> <--snip-->
>
>> In this regard it would be really great if we could implement a
>> simpler batch queueing system on the StarCluster itself and therefore
>> do away with the SGE. For instance, the implementation of load
>> balancing would be much simpler and better and, if needed, it can
>> completely terminate the whole instances.
>
> <--snip-->
>
> For one, I am really interested in an alternative to SGE, given its
> uncertain status after the Oracle purchase of Sun. Having discussed
> this at some length with Mr. Riley during previous conversations, its
> entirely likely that a good solution is possible using existing open
> source software, at least to a large extent. However, some components
> of SGE will be time consuming to implement in comparable fashion, for
> example the internal SGE load balancing mechanism.
>
> In any event, we, at my company, have had an excellent experience
> using the python-based Celery Distributed Task Queue [1] on local
> development and production clusters. One potential downside for this
> is, in our case, the requirement of the Erlang-based RabbitMQ for
> message queuing (using AMQP). Celery itself does not demand RabbitMQ,
> but in our experience it worked the best for our use cases. Its
> entirely possible to replace RabbitMQ with something like ZeroMQ,
> however this could require considerable effort.
>
> If anyone is interested in exploring this further, and in greater
> detail with the goal of an implementation, lets talk. Perhaps a new
> thread would be appropriate. I certainly think this could be highly
> beneficial and would be happy to participate.
>
> Kind Regards
>
> Matthew W. Summers
>
> [1] http://ask.github.com/celery/
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
Received on Sat Mar 26 2011 - 07:33:41 EDT
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject