Connection lost among nodes

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

From: Nick Stong <no email>
Date: Fri, 27 Mar 2015 10:49:36 -0400

I am having trouble scaling up a cluster to handle multiple samples. A
test with one 1Tb attached EBS drive and 30 nodes worked great. Moving to
10 EBS drives and 300 nodes crashed at about 70-80 nodes. There are no
errors in the STDOUT or STDERR output of jobs. Looking the in node
messages (/opt/sge6/spool/...), most nodes have a commlib error got read
error , followed by the job failing causing a shepherd error. The cluster
is set up on a private subnet and as I mentioned I have 10 EBS drives
attached which may be too much? I was also pushing the nodes with intense
work, with an nload_average of 3-4. My guess is the load is increased by
having a larger cluster and causing the issue. Does anyone know if that
could be the problem or have any other ideas?

Thanks,
Nick
Received on Fri Mar 27 2015 - 10:49:38 EDT

This message: [ Message body ]
Next message: Ryan G: "Mounting EBS volumes"
Previous message: Eduardo Gurgel Valente: "Re: Starcluster error running create users"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

This archive was generated by hypermail 2.3.0.

Connection lost among nodes

Search:

Sort all by:

Navigation

Connection lost among nodes

Search:

Sort all by:

Navigation