This archive was generated by
Just discovered the StarCluster system, and enjoyed looking through the
site. Looks like a very well thought out design. Hope to try it sometime
in the near future.
At Insilicos (www.insilicos.com) we have built a similar platform for
parallel computing in the AWS cloud. Like StarCluster, it uses MPI for
internode communication. We're curious how reliable MPI has been for you.
Occasionally, we have problems bringing up a cluster because mpdboot fails
to establish a communications ring. It's pretty clear this is due to higher
than usual latency between nodes. We take the obvious precaution of making
sure all nodes are provisioned in the same availability zone, and have even
tweaked the timeout tolerances inside mpdboot.py, but there are still
sporadic bad days when the problem occurs.
Anything you can share on your history (or lack thereof) with this problem,
and approaches to resolving it, would be much appreciated.
Received on Wed Feb 16 2011 - 18:21:58 EST