Hi All,
Okay, I'm in the Twilight Zone now. After starting a small cluster on the
23rd, and doing minimal reconfig (qmod -d) to disable the sge_execd on the
master and qconf -mq all.q to change some slot counts -- all of which
worked fine -- I come back these days later to find an unusable SGE config:
root_at_AWS-VTMXmaster-w2b ~
# qstat -f
error: sge_gethostbyname failed
/etc/hosts is correct for all its (internal) host addrs:
root_at_AWS-VTMXmaster-w2b ~
# cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4
localhost4.localdomain4
::1 localhost localhost.localdomain localhost6
localhost6.localdomain6
10.250.65.204 master
10.251.30.12 node001
The gethostbyname utility works correctly (so does gethostbyaddr):
root_at_AWS-VTMXmaster-w2b /opt/sge6/default/common/install_logs
# /opt/sge6/utilbin/linux-x64/gethostbyname master
Hostname: master
Aliases:
Host Address(es): 10.250.65.204
root_at_AWS-VTMXmaster-w2b /opt/sge6/default/common/install_logs
# /opt/sge6/utilbin/linux-x64/gethostbyname node001
Hostname: node001
Aliases:
Host Address(es): 10.251.30.12
root_at_AWS-VTMXmaster-w2b /opt/sge6/default/common/install_logs
# qstat -f
error: sge_gethostbyname failed
I went so far as to edit the hostname in /etc/sysconfig/network to contain
"master" and "node001" on the two nodes. Same error.
I have been all over the 'net looking for solutions, but have found nothing
with a clear resolution. gridengine.sunsource.net is gone. The follow-on
at
http://gridengine.org/pipermail/users/ doesn't seem to be searchable,
except on an onerous, month-by-month click-thru basis (which hasn't yielded
anything useful as I slog thru it).
Short of starcluster restart'ing, I'll appreciate anyone's inputs on what
to try next.
Thanks much,
Lyn
Received on Fri Dec 27 2013 - 17:34:34 EST