Unable to launch cluster: SGE master install failures
Hi All,
I am trying to launch a 3-node cluster (using 0.94.3), and keep getting an
error during SGE install on the master, which blows the install of it and
the remaining nodes out of the water.
My starcluster config file specifies disable_queue = true and then invokes
the sge plugin with MASTER_IS_EXEC_HOST = False, so all it needs to do is
install and bring up qmaster.
The qmaster does come up, however, the cluster start keeps timing out with
the following:
>>> Installing Sun Grid Engine...
!!! ERROR - Error occured while running plugin 'sge':
!!! ERROR - remote command 'source /etc/profile && cd /opt/sge6 &&
!!! ERROR - TERM=rxvt ./inst_sge -m -noremote -auto ./ec2_sge.conf'
!!! ERROR - failed with status 1:
!!! ERROR - Reading configuration from file ./ec2_sge.conf
!!! ERROR - [H[2JInstall log can be found in: /opt/sge6/default/common/i
!!! ERROR - nstall_logs/qmaster_install_master_2014-01-22_21:55:08.log
In the install log, it's waiting for the SGE qmaster pid file to show up,
times out after 5mins, and tells me to check my autoinstall config file.
Here are the ps output, and the installation log.
root_at_AWS-VTMXmaster-w2b /opt/sge6/default/common/install_logs
# ps -ef|grep master
avahi 1038 1 0 20:42 ? 00:00:00 avahi-daemon: running
[master.local]
root 1442 1 0 20:43 ? 00:00:00 /usr/libexec/postfix/master
sgeadmin 1629 1 0 20:43 ? 00:00:00
/opt/sge6/bin/linux-x64/sge_qmaster
root 18277 4408 0 21:30 pts/0 00:00:00 /bin/grep --color=auto
master
root_at_AWS-VTMXmaster-w2b /opt/sge6/default/common/install_logs
# cat qmaster_install_master_2014-01-22_21:22:55.log
Starting qmaster installation!
Installing Grid Engine as admin user >sgeadmin<
Your $SGE_ROOT directory: /opt/sge6
Using SGE_QMASTER_PORT >63231<.
Using SGE_EXECD_PORT >63232<.
Using >default< as CELL_NAME.
Your $SGE_CLUSTER_NAME: starcluster
Using >/opt/sge6/default/spool/qmaster< as QMASTER_SPOOL_DIR.
Obviously this is not a complete Grid Engine distribution or this
is not your $SGE_ROOT directory.
Missing file or directory: start_gui_installer
Your file permissions will not be set. Exit.
Using >true< as IGNORE_FQDN_DEFAULT.
If it's >true<, the domain name will be ignored.
Making directories
Setting spooling method to dynamic
Dumping bootstrapping information
Initializing spooling database
Using >20000-20100< as gid range.
Using >/opt/sge6/default/spool< as EXECD_SPOOL_DIR.
Using >none_at_none.edu< as ADMIN_MAIL.
Adding default parallel environments (PE)
starting sge_qmaster
Reached 5min timeout, while waiting for qmaster PID file.
sge_qmaster daemon didn't start. Please check your
autoinstall configuration file! Installation failed!
"
It's this same error on every attempt, and I am using an unmodified
ec2_sge.conf file.
Appreciate any suggestions for how to get over this.
Thanks much,
Lyn
Received on Wed Jan 22 2014 - 17:10:02 EST
This archive was generated by
hypermail 2.3.0.