StarCluster - Mailing List Archive

Re: Twilight Zone: sge_gethostbyname failed

From: Rayson Ho <no email>
Date: Fri, 27 Dec 2013 19:57:50 -0500

We need to change the SC code for RHEL-based distros. Each distro does
things slightly differently, and that's why you get that behavior.

In the mean time, you might want to go to each node and set the
hostname by editing /etc/sysconfig/network and running hostname <name>
as root, and then restart OGS/GE.

Rayson

==================================================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/
http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html


On Fri, Dec 27, 2013 at 7:47 PM, Lyn Gerner <schedulerqueen_at_gmail.com> wrote:
> Yep, it works again with those changes.
>
> So, how should I stop the regression in a non-kludgy way?
>
> Thanks again,
> Lyn
>
>
> On Fri, Dec 27, 2013 at 2:43 PM, Rayson Ho <raysonlogin_at_gmail.com> wrote:
>>
>> /etc/sysconfig/network is read during reboot, and may be after DHCP...
>>
>> To see if it is the issue, set HOSTNAME back to master, and also run
>> "hostname master" as root.
>>
>> Rayson
>>
>> ==================================================
>> Open Grid Scheduler - The Official Open Source Grid Engine
>> http://gridscheduler.sourceforge.net/
>> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>>
>>
>> On Fri, Dec 27, 2013 at 7:40 PM, Lyn Gerner <schedulerqueen_at_gmail.com>
>> wrote:
>> > Thanks for digging, Rayson.
>> >
>> > So, /etc/sysconfig/network had HOSTNAME=centos-ami when the problem
>> > first
>> > occurred. I tried resetting it to "master" and then retried the SGE
>> > commands (qstat, qsub, etc.). They still failed with the same error at
>> > that
>> > point, so I switched them back, not knowing for sure if they'd been set
>> > to
>> > master and node001 to begin with.
>> >
>> > Thanks,
>> > Lyn
>> >
>> >
>> > On Fri, Dec 27, 2013 at 2:35 PM, Rayson Ho <raysonlogin_at_gmail.com>
>> > wrote:
>> >>
>> >> (Updating the list...)
>> >>
>> >> The hostname on the master gets reset to centos-ami, which is not
>> >> resolvable. Thus Grid Engine complains about the hostname issue.
>> >>
>> >> Lyn: what is the value of the HOSTNAME key in "/etc/sysconfig/network"
>> >> on your master instance??
>> >>
>> >> Justin & other devs: set_hostname() in node.py works on Ubuntu because
>> >> Ubuntu uses /etc/hostname, but RHEL (and RHEL-based distros like
>> >> CentOS, Oracle Linux, Scientific Linux) uses /etc/sysconfig/network,
>> >> and yet SuSE uses /etc/HOSTNAME!
>> >>
>> >> Rayson
>> >>
>> >> ==================================================
>> >> Open Grid Scheduler - The Official Open Source Grid Engine
>> >> http://gridscheduler.sourceforge.net/
>> >> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>> >>
>> >>
>> >> On Fri, Dec 27, 2013 at 6:39 PM, Lyn Gerner <schedulerqueen_at_gmail.com>
>> >> wrote:
>> >> > I used the Scientific Linux AMI (been a long time, but I found it
>> >> > from
>> >> > the
>> >> > SC site), and 0.94.3 is my SC version.
>> >> >
>> >> >
>> >> > On Fri, Dec 27, 2013 at 1:36 PM, Rayson Ho <raysonlogin_at_gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> Hmm, which AMI did you use, and what's the version of SC?
>> >> >>
>> >> >> Rayson
>> >> >>
>> >> >> ==================================================
>> >> >> Open Grid Scheduler - The Official Open Source Grid Engine
>> >> >> http://gridscheduler.sourceforge.net/
>> >> >> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>> >> >>
>> >> >>
>> >> >> On Fri, Dec 27, 2013 at 6:33 PM, Lyn Gerner
>> >> >> <schedulerqueen_at_gmail.com>
>> >> >> wrote:
>> >> >> > root_at_AWS-VTMXmaster-w2b /opt/sge6/default/common/install_logs
>> >> >> > # /opt/sge6/utilbin/linux-x64/gethostname -name
>> >> >> > error resolving local host: can't resolve host name (h_errno =
>> >> >> > HOST_NOT_FOUND)
>> >> >> >
>> >> >> > root_at_AWS-VTMXmaster-w2b /opt/sge6/default/common/install_logs
>> >> >> > # hostname
>> >> >> > centos-ami
>> >> >> >
>> >> >> > root_at_AWS-VTMXmaster-w2b /opt/sge6/default/common/install_logs
>> >> >> > # hostname -f
>> >> >> > hostname: Unknown host
>> >> >> >
>> >> >> > What's weird is that I have never mucked with any of this under
>> >> >> > StarCluster,
>> >> >> > and have only recently started having problems. Can't pinpoint
>> >> >> > any
>> >> >> > specific
>> >> >> > event or thing that changed--except that I started leaving the
>> >> >> > config
>> >> >> > up
>> >> >> > for
>> >> >> > days instead of hours at a stretch.
>> >> >> >
>> >> >> > Thanks,
>> >> >> > Lyn
>> >> >> >
>> >> >> >
>> >> >> > On Fri, Dec 27, 2013 at 1:30 PM, Rayson Ho <raysonlogin_at_gmail.com>
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> No problem, and I think that's why it is failing. Can you also
>> >> >> >> send
>> >> >> >> me
>> >> >> >> the output of:
>> >> >> >>
>> >> >> >> 1) gethostname -name
>> >> >> >>
>> >> >> >> 2) hostname
>> >> >> >>
>> >> >> >> 3) hostname -f
>> >> >> >>
>> >> >> >> Rayson
>> >> >> >>
>> >> >> >> ==================================================
>> >> >> >> Open Grid Scheduler - The Official Open Source Grid Engine
>> >> >> >> http://gridscheduler.sourceforge.net/
>> >> >> >>
>> >> >> >> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>> >> >> >>
>> >> >> >>
>> >> >> >> On Fri, Dec 27, 2013 at 6:27 PM, Lyn Gerner
>> >> >> >> <schedulerqueen_at_gmail.com>
>> >> >> >> wrote:
>> >> >> >> > My bad:
>> >> >> >> >
>> >> >> >> > root_at_AWS-VTMXmaster-w2b /opt/sge6/default/common/install_logs
>> >> >> >> > # /opt/sge6/utilbin/linux-x64/gethostname -all
>> >> >> >> > error resolving local host: can't resolve host name (h_errno =
>> >> >> >> > HOST_NOT_FOUND)
>> >> >> >> >
>> >> >> >> > Thanks for any insights,
>> >> >> >> > Lyn
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > On Fri, Dec 27, 2013 at 1:25 PM, Rayson Ho
>> >> >> >> > <raysonlogin_at_gmail.com>
>> >> >> >> > wrote:
>> >> >> >> >>
>> >> >> >> >> But I need the output of "gethostname", not "gethostbyname"...
>> >> >> >> >> :-P
>> >> >> >> >>
>> >> >> >> >> Rayson
>> >> >> >> >>
>> >> >> >> >> ==================================================
>> >> >> >> >> Open Grid Scheduler - The Official Open Source Grid Engine
>> >> >> >> >> http://gridscheduler.sourceforge.net/
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> On Fri, Dec 27, 2013 at 6:11 PM, Lyn Gerner
>> >> >> >> >> <schedulerqueen_at_gmail.com>
>> >> >> >> >> wrote:
>> >> >> >> >> > Thanks for the quick response, Rayson. Output from
>> >> >> >> >> > gethostbyname
>> >> >> >> >> > is
>> >> >> >> >> > in
>> >> >> >> >> > between the ****s below:
>> >> >> >> >> >
>> >> >> >> >> > On Fri, Dec 27, 2013 at 1:04 PM, Rayson Ho
>> >> >> >> >> > <raysonlogin_at_gmail.com>
>> >> >> >> >> > wrote:
>> >> >> >> >> >>
>> >> >> >> >> >> What is the output of "gethostname"? (gethostname is
>> >> >> >> >> >> shipped
>> >> >> >> >> >> with
>> >> >> >> >> >> SGE
>> >> >> >> >> >> in the util dir.)
>> >> >> >> >> >>
>> >> >> >> >> >> Rayson
>> >> >> >> >> >>
>> >> >> >> >> >> ==================================================
>> >> >> >> >> >> Open Grid Scheduler - The Official Open Source Grid Engine
>> >> >> >> >> >> http://gridscheduler.sourceforge.net/
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >> On Fri, Dec 27, 2013 at 5:34 PM, Lyn Gerner
>> >> >> >> >> >> <schedulerqueen_at_gmail.com>
>> >> >> >> >> >> wrote:
>> >> >> >> >> >> > Hi All,
>> >> >> >> >> >> >
>> >> >> >> >> >> > Okay, I'm in the Twilight Zone now. After starting a
>> >> >> >> >> >> > small
>> >> >> >> >> >> > cluster
>> >> >> >> >> >> > on
>> >> >> >> >> >> > the
>> >> >> >> >> >> > 23rd, and doing minimal reconfig (qmod -d) to disable the
>> >> >> >> >> >> > sge_execd
>> >> >> >> >> >> > on
>> >> >> >> >> >> > the
>> >> >> >> >> >> > master and qconf -mq all.q to change some slot counts --
>> >> >> >> >> >> > all
>> >> >> >> >> >> > of
>> >> >> >> >> >> > which
>> >> >> >> >> >> > worked
>> >> >> >> >> >> > fine -- I come back these days later to find an unusable
>> >> >> >> >> >> > SGE
>> >> >> >> >> >> > config:
>> >> >> >> >> >> >
>> >> >> >> >> >> > root_at_AWS-VTMXmaster-w2b ~
>> >> >> >> >> >> > # qstat -f
>> >> >> >> >> >> > error: sge_gethostbyname failed
>> >> >> >> >> >> >
>> >> >> >> >> >> > /etc/hosts is correct for all its (internal) host addrs:
>> >> >> >> >> >> >
>> >> >> >> >> >> > root_at_AWS-VTMXmaster-w2b ~
>> >> >> >> >> >> > # cat /etc/hosts
>> >> >> >> >> >> > 127.0.0.1 localhost localhost.localdomain localhost4
>> >> >> >> >> >> > localhost4.localdomain4
>> >> >> >> >> >> > ::1 localhost localhost.localdomain localhost6
>> >> >> >> >> >> > localhost6.localdomain6
>> >> >> >> >> >> > 10.250.65.204 master
>> >> >> >> >> >> > 10.251.30.12 node001
>> >> >> >> >> >> >
>> >> >> >> >> >> *****
>> >> >> >> >> >>
>> >> >> >> >> >> > The gethostbyname utility works correctly (so does
>> >> >> >> >> >> > gethostbyaddr):
>> >> >> >> >> >> >
>> >> >> >> >> >> > root_at_AWS-VTMXmaster-w2b
>> >> >> >> >> >> > /opt/sge6/default/common/install_logs
>> >> >> >> >> >> > # /opt/sge6/utilbin/linux-x64/gethostbyname master
>> >> >> >> >> >> > Hostname: master
>> >> >> >> >> >> > Aliases:
>> >> >> >> >> >> > Host Address(es): 10.250.65.204
>> >> >> >> >> >> >
>> >> >> >> >> >> > root_at_AWS-VTMXmaster-w2b
>> >> >> >> >> >> > /opt/sge6/default/common/install_logs
>> >> >> >> >> >> > # /opt/sge6/utilbin/linux-x64/gethostbyname node001
>> >> >> >> >> >> > Hostname: node001
>> >> >> >> >> >> > Aliases:
>> >> >> >> >> >> > Host Address(es): 10.251.30.12
>> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> > ******
>> >> >> >> >> >>
>> >> >> >> >> >> >
>> >> >> >> >> >> > root_at_AWS-VTMXmaster-w2b
>> >> >> >> >> >> > /opt/sge6/default/common/install_logs
>> >> >> >> >> >> > # qstat -f
>> >> >> >> >> >> > error: sge_gethostbyname failed
>> >> >> >> >> >> >
>> >> >> >> >> >> >
>> >> >> >> >> >> > I went so far as to edit the hostname in
>> >> >> >> >> >> > /etc/sysconfig/network
>> >> >> >> >> >> > to
>> >> >> >> >> >> > contain
>> >> >> >> >> >> > "master" and "node001" on the two nodes. Same error.
>> >> >> >> >> >> >
>> >> >> >> >> >> > I have been all over the 'net looking for solutions, but
>> >> >> >> >> >> > have
>> >> >> >> >> >> > found
>> >> >> >> >> >> > nothing
>> >> >> >> >> >> > with a clear resolution. gridengine.sunsource.net is
>> >> >> >> >> >> > gone.
>> >> >> >> >> >> > The
>> >> >> >> >> >> > follow-on
>> >> >> >> >> >> > at http://gridengine.org/pipermail/users/ doesn't seem to
>> >> >> >> >> >> > be
>> >> >> >> >> >> > searchable,
>> >> >> >> >> >> > except on an onerous, month-by-month click-thru basis
>> >> >> >> >> >> > (which
>> >> >> >> >> >> > hasn't
>> >> >> >> >> >> > yielded
>> >> >> >> >> >> > anything useful as I slog thru it).
>> >> >> >> >> >> >
>> >> >> >> >> >> > Short of starcluster restart'ing, I'll appreciate
>> >> >> >> >> >> > anyone's
>> >> >> >> >> >> > inputs
>> >> >> >> >> >> > on
>> >> >> >> >> >> > what to
>> >> >> >> >> >> > try next.
>> >> >> >> >> >> >
>> >> >> >> >> >> > Thanks much,
>> >> >> >> >> >> > Lyn
>> >> >> >> >> >> >
>> >> >> >> >> >> >
>> >> >> >> >> >> > _______________________________________________
>> >> >> >> >> >> > StarCluster mailing list
>> >> >> >> >> >> > StarCluster_at_mit.edu
>> >> >> >> >> >> > http://mailman.mit.edu/mailman/listinfo/starcluster
>> >> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >
>> >> >> >
>> >> >
>> >> >
>> >
>> >
>
>
Received on Fri Dec 27 2013 - 19:57:52 EST
This archive was generated by hypermail 2.3.0.

Search:

Sort all by:

Date

Month

Thread

Author

Subject