starcluster starts but not all nodes added as exec nodes
I can frequently reproduce an issue where 'starcluster start' completes
without error, but not all nodes are added to the SGE pool, which I verify
by running 'qconf -sel' on the master. The latest example I have is creating
a 25-node cluster, where only the first 12 nodes are successfully installed.
The remaining instances are running and I can ssh to them but they aren't
running sge_execd. There are only install log files for the first 12 nodes
in /opt/sge6/default/common/install_logs. I have not found any clues in the
starcluster debug log or the logs inside master:/opt/sge6/.
I am running starcluster development snapshot 8ef48a3 downloaded on
2011-02-15, with the following relevant settings:
NODE_IMAGE_ID=ami-8cf913e5
NODE_INSTANCE_TYPE = m1.small
I have seen this behavior with the latest 32-bit and 64-bit starcluster
AMIs. Our workaround is to start a small cluster and progressively add nodes
one at a time, which is time-consuming.
Has anyone else noticed this and have a better workaround or an idea for a
fix?
jeff
Received on Sat Mar 05 2011 - 17:15:42 EST
This archive was generated by
hypermail 2.3.0.