Unable to launch cluster: SGE master install failures

From: Lyn Gerner <no email>
Date: Wed, 22 Jan 2014 12:10:00 -1000

Hi All,

I am trying to launch a 3-node cluster (using 0.94.3), and keep getting an
error during SGE install on the master, which blows the install of it and
the remaining nodes out of the water.

My starcluster config file specifies disable_queue = true and then invokes
the sge plugin with MASTER_IS_EXEC_HOST = False, so all it needs to do is
install and bring up qmaster.

The qmaster does come up, however, the cluster start keeps timing out with
the following:

>>> Installing Sun Grid Engine...
!!! ERROR - Error occured while running plugin 'sge':
!!! ERROR - remote command 'source /etc/profile && cd /opt/sge6 &&
!!! ERROR - TERM=rxvt ./inst_sge -m -noremote -auto ./ec2_sge.conf'
!!! ERROR - failed with status 1:
!!! ERROR - Reading configuration from file ./ec2_sge.conf
!!! ERROR - [H[2JInstall log can be found in: /opt/sge6/default/common/i
!!! ERROR - nstall_logs/qmaster_install_master_2014-01-22_21:55:08.log

In the install log, it's waiting for the SGE qmaster pid file to show up,
times out after 5mins, and tells me to check my autoinstall config file.

Here are the ps output, and the installation log.

root_at_AWS-VTMXmaster-w2b /opt/sge6/default/common/install_logs
# ps -ef|grep master
avahi 1038 1 0 20:42 ? 00:00:00 avahi-daemon: running
root 1442 1 0 20:43 ? 00:00:00 /usr/libexec/postfix/master
sgeadmin 1629 1 0 20:43 ? 00:00:00
root 18277 4408 0 21:30 pts/0 00:00:00 /bin/grep --color=auto

root_at_AWS-VTMXmaster-w2b /opt/sge6/default/common/install_logs
# cat qmaster_install_master_2014-01-22_21:22:55.log
Starting qmaster installation!

Installing Grid Engine as admin user >sgeadmin<

Your $SGE_ROOT directory: /opt/sge6

Using SGE_QMASTER_PORT >63231<.

Using SGE_EXECD_PORT >63232<.

Using >default< as CELL_NAME.

Your $SGE_CLUSTER_NAME: starcluster

Using >/opt/sge6/default/spool/qmaster< as QMASTER_SPOOL_DIR.

Obviously this is not a complete Grid Engine distribution or this
is not your $SGE_ROOT directory.

Missing file or directory: start_gui_installer

Your file permissions will not be set. Exit.

Using >true< as IGNORE_FQDN_DEFAULT.
If it's >true<, the domain name will be ignored.

Making directories

Setting spooling method to dynamic
Dumping bootstrapping information
Initializing spooling database

Using >20000-20100< as gid range.
Using >/opt/sge6/default/spool< as EXECD_SPOOL_DIR.
Using >< as ADMIN_MAIL.
Adding default parallel environments (PE)

   starting sge_qmaster
Reached 5min timeout, while waiting for qmaster PID file.
sge_qmaster daemon didn't start. Please check your
autoinstall configuration file! Installation failed!

It's this same error on every attempt, and I am using an unmodified
ec2_sge.conf file.

Appreciate any suggestions for how to get over this.

Thanks much,
Received on Wed Jan 22 2014 - 17:10:02 EST
