Running MPI over the cluster
Hi,
I am running a python program for distributed implementation of
gradient descent. The gradient descent step involves an Allreduce
function to obtain the overall gradient at the workers. I have been
setting up clusters earlier without Starcluster, but recently I needed
to use large clusters and had to move to Starcluster. I was surprised
to see that the MPI.Allreduce operation is much faster in a cluster
generated by Starcluster than in a cluster set up by using traditional
methods. I am curious to know if this is an artifact, or Starcluster
optimizes the communication network somehow to enable efficient
Allreduce step. I am attaching the code for a cluster of 43 nodes (1
master and 42 workers), though I have replaced the data loading with
random data initialization to mimic the gradient step. Any insight
regarding this would be extremely helpful.
Thanks,
Saurav.
Received on Thu Oct 25 2018 - 04:44:33 EDT
This archive was generated by
hypermail 2.3.0.