StarCluster - Mailing List Archive

Re: Unable to launch cluster: SGE master install failures

From: Rayson Ho <no email>
Date: Wed, 22 Jan 2014 17:13:56 -0500

Which AMI did you use? Seems like it is missing some files...

Rayson

==================================================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/
http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html


On Wed, Jan 22, 2014 at 5:10 PM, Lyn Gerner <schedulerqueen_at_gmail.com> wrote:
> Hi All,
>
> I am trying to launch a 3-node cluster (using 0.94.3), and keep getting an
> error during SGE install on the master, which blows the install of it and
> the remaining nodes out of the water.
>
> My starcluster config file specifies disable_queue = true and then invokes
> the sge plugin with MASTER_IS_EXEC_HOST = False, so all it needs to do is
> install and bring up qmaster.
>
> The qmaster does come up, however, the cluster start keeps timing out with
> the following:
>
>>>> Installing Sun Grid Engine...
> !!! ERROR - Error occured while running plugin 'sge':
> !!! ERROR - remote command 'source /etc/profile && cd /opt/sge6 &&
> !!! ERROR - TERM=rxvt ./inst_sge -m -noremote -auto ./ec2_sge.conf'
> !!! ERROR - failed with status 1:
> !!! ERROR - Reading configuration from file ./ec2_sge.conf
> !!! ERROR - [H[2JInstall log can be found in: /opt/sge6/default/common/i
> !!! ERROR - nstall_logs/qmaster_install_master_2014-01-22_21:55:08.log
>
> In the install log, it's waiting for the SGE qmaster pid file to show up,
> times out after 5mins, and tells me to check my autoinstall config file.
>
> Here are the ps output, and the installation log.
>
> root_at_AWS-VTMXmaster-w2b /opt/sge6/default/common/install_logs
> # ps -ef|grep master
> avahi 1038 1 0 20:42 ? 00:00:00 avahi-daemon: running
> [master.local]
> root 1442 1 0 20:43 ? 00:00:00 /usr/libexec/postfix/master
> sgeadmin 1629 1 0 20:43 ? 00:00:00
> /opt/sge6/bin/linux-x64/sge_qmaster
> root 18277 4408 0 21:30 pts/0 00:00:00 /bin/grep --color=auto
> master
>
> root_at_AWS-VTMXmaster-w2b /opt/sge6/default/common/install_logs
> # cat qmaster_install_master_2014-01-22_21:22:55.log
> Starting qmaster installation!
>
> Installing Grid Engine as admin user >sgeadmin<
>
>
>
> Your $SGE_ROOT directory: /opt/sge6
>
> Using SGE_QMASTER_PORT >63231<.
>
> Using SGE_EXECD_PORT >63232<.
>
> Using >default< as CELL_NAME.
>
>
> Your $SGE_CLUSTER_NAME: starcluster
>
> Using >/opt/sge6/default/spool/qmaster< as QMASTER_SPOOL_DIR.
>
>
>
>
>
> Obviously this is not a complete Grid Engine distribution or this
> is not your $SGE_ROOT directory.
>
> Missing file or directory: start_gui_installer
>
> Your file permissions will not be set. Exit.
>
>
> Using >true< as IGNORE_FQDN_DEFAULT.
> If it's >true<, the domain name will be ignored.
>
>
> Making directories
>
> Setting spooling method to dynamic
> Dumping bootstrapping information
> Initializing spooling database
>
>
> Using >20000-20100< as gid range.
> Using >/opt/sge6/default/spool< as EXECD_SPOOL_DIR.
> Using >none_at_none.edu< as ADMIN_MAIL.
> Adding default parallel environments (PE)
>
>
>
> starting sge_qmaster
> Reached 5min timeout, while waiting for qmaster PID file.
> sge_qmaster daemon didn't start. Please check your
> autoinstall configuration file! Installation failed!
> "
>
> It's this same error on every attempt, and I am using an unmodified
> ec2_sge.conf file.
>
> Appreciate any suggestions for how to get over this.
>
> Thanks much,
> Lyn
>
> _______________________________________________
> StarCluster mailing list
> StarCluster_at_mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
Received on Wed Jan 22 2014 - 17:13:58 EST
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject