Unable to launch cluster: SGE master install failures
This archive was generated by
I am trying to launch a 3-node cluster (using 0.94.3), and keep getting an
error during SGE install on the master, which blows the install of it and
the remaining nodes out of the water.
My starcluster config file specifies disable_queue = true and then invokes
the sge plugin with MASTER_IS_EXEC_HOST = False, so all it needs to do is
install and bring up qmaster.
The qmaster does come up, however, the cluster start keeps timing out with
>>> Installing Sun Grid Engine...
!!! ERROR - Error occured while running plugin 'sge':
!!! ERROR - remote command 'source /etc/profile && cd /opt/sge6 &&
!!! ERROR - TERM=rxvt ./inst_sge -m -noremote -auto ./ec2_sge.conf'
!!! ERROR - failed with status 1:
!!! ERROR - Reading configuration from file ./ec2_sge.conf
!!! ERROR - [H[2JInstall log can be found in: /opt/sge6/default/common/i
!!! ERROR - nstall_logs/qmaster_install_master_2014-01-22_21:55:08.log
In the install log, it's waiting for the SGE qmaster pid file to show up,
times out after 5mins, and tells me to check my autoinstall config file.
Here are the ps output, and the installation log.
# ps -ef|grep master
avahi 1038 1 0 20:42 ? 00:00:00 avahi-daemon: running
root 1442 1 0 20:43 ? 00:00:00 /usr/libexec/postfix/master
sgeadmin 1629 1 0 20:43 ? 00:00:00
root 18277 4408 0 21:30 pts/0 00:00:00 /bin/grep --color=auto
# cat qmaster_install_master_2014-01-22_21:22:55.log
Starting qmaster installation!
Installing Grid Engine as admin user >sgeadmin<
Your $SGE_ROOT directory: /opt/sge6
Using SGE_QMASTER_PORT >63231<.
Using SGE_EXECD_PORT >63232<.
Using >default< as CELL_NAME.
Your $SGE_CLUSTER_NAME: starcluster
Using >/opt/sge6/default/spool/qmaster< as QMASTER_SPOOL_DIR.
Obviously this is not a complete Grid Engine distribution or this
is not your $SGE_ROOT directory.
Missing file or directory: start_gui_installer
Your file permissions will not be set. Exit.
Using >true< as IGNORE_FQDN_DEFAULT.
If it's >true<, the domain name will be ignored.
Setting spooling method to dynamic
Dumping bootstrapping information
Initializing spooling database
Using >20000-20100< as gid range.
Using >/opt/sge6/default/spool< as EXECD_SPOOL_DIR.
Using >none_at_none.edu< as ADMIN_MAIL.
Adding default parallel environments (PE)
Reached 5min timeout, while waiting for qmaster PID file.
sge_qmaster daemon didn't start. Please check your
autoinstall configuration file! Installation failed!
It's this same error on every attempt, and I am using an unmodified
Appreciate any suggestions for how to get over this.
Received on Wed Jan 22 2014 - 17:10:02 EST