[Star cluster] error tolerance design when adding nodes

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

From: Jin Yu <no email>
Date: Sun, 20 Jul 2014 14:08:00 -0500

Hello,

For an example, I just found it is not uncommon to have one or two
instances not communicable after you adding 50 instances in the cluster.
The progress bar got stuck when waiting for ssh. And I have to manually
restart those problematic instances.

I have not yet went through the codes of starcluster, I wonder if
StarCluster already has some error tolerance design for these situation?

Thanks!
Jin
Received on Sun Jul 20 2014 - 15:08:01 EDT

This message: [ Message body ]
Next message: Jin Yu: "[Star Cluster] NoneType user errors when removing nodes"
Previous message: Rayson Ho: "Re: Instances are not accepting jobs when the slots are available."

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

This archive was generated by hypermail 2.3.0.

[Star cluster] error tolerance design when adding nodes

Search:

Sort all by:

Navigation

[Star cluster] error tolerance design when adding nodes

Search:

Sort all by:

Navigation