Box Tech Blog
Published in

Box Tech Blog

A Trip Down the DNS Rabbit Hole: Understanding the Role of Kubernetes, Golang, libc, systemd

Illustrated by Jeremy Nguyen/ Art directed by Sarah Kislak

Kubernetes powers much of the compute infrastructure at Box. As more and more applications and services have moved to our Kubernetes platform, we’ve invested in tools to make debugging, validating service correctness, and responding to incidents easier for Box developers.

One such tool is Comcast’s Kuberhealthy, a framework for synthetic monitoring in Kubernetes. Kuberhealthy introduces a controller and a CRD to define a Pod spec and a schedule to periodically create the Pod. The Pod can run an arbitrary container image and code, so long as it reports the results of the check back to the Kuberhealthy controller’s API.

Kuberhealthy has example check functionality published in the public Docker registry. Our first step in bringing Kuberhealthy into our Kubernetes platform was to get it working in minikube, our tool of choice for local development. At Box, we use SmartStack for service discovery, so we had to include a SmartStack sidecar alongside the Kuberhealthy check container. This allows us to report the results of the check back to the Kuberhealthy controller.

When we started everything up in minikube, we quickly ran into issues. While the controller started cleanly and the check Pod launched successfully, it was unable to report the results of the check back to the Kuberhealthy controller. Inspecting the error message, we quickly realized we had a DNS issue on our hands.

Get http://kube-health.localhost: dial tcp: lookup kube-health.localhost on no such host

To report results from the check Pod back to the Kuberhealthy controller, we defined the Kuberhealthy controller hostname as kube-health.localhost, which SmartStack would know how to resolve to the correct service.

The lookup process begins with a DNS query for kube-health.localhost from the Kuberhealthy check Pod. How DNS queries are performed depends on a combination of application code, GNU libc configuration, and operating system configuration. For most applications that link against libc, how libc performs DNS resolution is configured in /etc/nsswitch.conf. This applies for ping, curl, etc.

container$ cat /etc/nsswitch.conf
# /etc/nsswitch.conf
# Example configuration of GNU Name Service Switch functionality.
# If you have the `glibc-doc-reference' and `info' packages installed, try:
# `info libc "Name Service Switch"' for information about this file.

passwd: files
group: files
shadow: files
gshadow: files

hosts: files dns
networks: files

protocols: db files
services: db files
ethers: db files
rpc: db files

netgroup: nis

The hosts: files dns line means that use of getaddrinfo,gethostbyname, and friends will first read from the/etc/hosts before performing a DNS network query.

Looking at /etc/hosts, we can see there are no entries that match kube-health.localhost, so DNS lookups for that name should result in a DNS query on the network.

# Kubernetes-managed hosts file. localhost
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
fe00::0 ip6-mcastprefix
fe00::1 ip6-allnodes
fe00::2 ip6-allrouters kh-7fbbd69ffc-8tr6d

This story is a bit more complicated on systemd operating systems. The example Kuberhealthy deployment-check container is built FROM scratch, so is not using systemd.

If we rebuild it on a CentOS based image,FROM centos:latest., we get the following /etc/nsswitch.conf:

container$ cat /etc/nsswitch.conf

passwd: sss files systemd
shadow: files sss
group: sss files systemd
hosts: files dns myhostname
services: files sss
netgroup: sss
automount: files sss
aliases: files
ethers: files
gshadow: files
networks: files dns
protocols: files
publickey: files
rpc: files

We see that the hosts line has a new entry: hosts: files dns myhostname.

myhostname is an NSS plugin that extends how hostname lookup works. It injects the following rules into libc using programs:

•The local, configured hostname is resolved to all locally configured IP addresses ordered by their scope, or — if none are configured — the IPv4 address (which is on the local loopback) and the IPv6 address ::1 (which is the local host).

•The hostnames "localhost" and "localhost.localdomain" (as well as any hostname ending in ".localhost" or ".localhost.localdomain") are resolved to the IP addresses and ::1.

•The hostname "_gateway" is resolved to all current default routing gateway addresses, ordered by their metric. This assigns a stable hostname to the current gateway, useful for referencing it independently of the current network configuration state.

Note the very import: as well as any hostname ending in ".localhost" or ".localhost.localdomain") are resolved to the IP addresses and ::1. We will return to this.

Golang programs on the other hand do not use the libc DNS resolver automatically. Instead, DNS resolution is implemented natively in go. The determination of which resolver to use happens at runtime, for every resolve query.


By default the pure Go resolver is used, because a blocked DNS request consumes only a goroutine, while a blocked C call consumes an operating system thread. When cgo is available, the cgo-based resolver is used instead under a variety of conditions: on systems that do not let programs make direct DNS requests (OS X), when the LOCALDOMAIN environment variable is present (even if empty), when the RES_OPTIONS or HOSTALIASES environment variable is non-empty, when the ASR_CONFIG environment variable is non-empty (OpenBSD only), when /etc/resolv.conf or /etc/nsswitch.conf specify the use of features that the Go resolver does not implement, and when the name being looked up ends in .local or is an mDNS name.

Inspecting the environment on the Kuberhealthy check Pod container:

container$ env

So none of the environment variables are set, and nothing in /etc/nsswitch.conf hosts line looks special. So we expect that Kuberhealthy code will use the Golang DNS resolver which will use /etc/resolv.conf to determine where and how to send the DNS query.

Looking at /etc/resolve.conf, we can see how DNS queries that hit the network will be processed.

container$ cat /etc/resolv.conf
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

Kubernetes manages this file, having the container runtime configure it when the container launches. kubelet's command line arguments--cluster-dns and --cluster-domain configure the nameserver and search lines respectively.

A lookup of kube-health.localhost has fewer than 5 "dots", so the actual DNS queries made will be:


All of these queries will be sent to nameserver So what is

$ kubectl get services -n=kube-system
kube-dns <none> 53/UDP,53/TCP 1h

That is a service cluster-ip, so behind the scenes it’s a set of iptables or ipvs rules on the underlying Kubernetes Node. So the actual queries to will be round-robined to the Pod ip addresses that are part of the kube-dns. These ip addresses are owned by coredns Pods.

$ kubectl get services kube-dns -n=kube-system -o json | jq .spec.selector
"k8s-app": "kube-dns"
$ kubectl get pods --all-namespaces -l k8s-app=kube-dns -o wide
kube-system coredns-54ff9cd656-8s8v9 1/1 Running 1 1h minikube
kube-system coredns-54ff9cd656-g78p5 1/1 Running 1 1h minikube

So the queries will land on one of the coredns Pods. When the query is received, coredns will consult its configuration to decide how to perform the lookup. The coredns configuration is refreshingly straightforward.

In Kubernetes, coredns configuration is provided to the coredns Pods via a ConfigMap.

$ kubectl get cm coredns -n=kube-system -o json | jq .data.Corefile -r
.:53 {
kubernetes cluster.local {
pods insecure
prometheus :9153
proxy . /etc/resolv.conf
cache 30

The main component here is the kubernetes plugin block. This is a corednsplugin to that knows how to watch the Kubernetes apiserver for Services and Endpoints and bind the service names to the Pod Endpoint ip addresses. It will answer any queries that end in cluster.local (and some reverse queries, too).

Recall that our queries will look like this:


So, they match this plugin. coredns will not have any record for kube-health.localhost so it will fall through out of this plugin (and the next plugin) to the proxy plugin.

proxy . /etc/resolv.conf states that all queries should be forwarded to the config specified in the /etc/resolv.conf

The contents of /etc/resolve.conf inside any container is controlled by the Pod's dnsPolicy. Normal Pods use the standard value of ClusterFirst which causes the flow we been examining thus far (see the beginning of this note for a review of what /etc/resolve.conf looks like in this case). But for coredns, the dnsPolicy is set to Default.

$ kubectl get deploy coredns -n=kube-system -o json | jq .spec.template.spec.dnsPolicy

This is called Default because the default behavior of Docker is to use the /etc/resolve.conf (along with /etc/hosts) from the host when Docker launches containers. For coredns containers the contents of /etc/resolve.conf will match that of the host, minikube. Inspecting that file we see a simple, single name server entry. This is how DNS queries for non-Kubernetes services from inside Pods will be routed.

minikube$ cat /etc/resolv.conf
# This file is managed by man:systemd-resolved(8). Do not edit.
# This is a dynamic resolv.conf file for connecting local clients directly to
# all known uplink DNS servers. This file lists all configured search domains.
# Third party programs must not access this file directly, but only through the
# symlink at /etc/resolv.conf. To manage man:resolv.conf(5) in a different way,
# replace this symlink by a static file or a different symlink.
# See man:systemd-resolved.service(8) for details about the supported modes of
# operation for /etc/resolv.conf.


So, what is Looking at configured routes on minikube:

minikube$ ip route
default via dev eth1 proto dhcp src metric 1024 dev eth1 proto kernel scope link src dev eth1 proto dhcp scope link src metric 1024 dev docker0 proto kernel scope link src

Packets for will be routed via eth1 to the default gateway of is the virtual NAT gateway created by VirtualBox when configuring a NAT-type NIC. Likewise, is the nameserver that VirtualBox creates for NAT interfaces.

From the VirtualBox docs:

The NAT engine by default offers the same DNS servers to the guest that are configured on the host.

So our DNS query packet will be sent to the VirtualBox host’s nameserver. Sure enough, there's nothing registered for the name kube-health.localhost or any of the variants, and we get an NXDOMAIN reply:

$ nslookup kube-health.localhost.cluster.local

** server can't find kube-health.localhost.cluster.local: NXDOMAIN

If you’ve read this far, this is consistent with the original problem: lookups of kube-health.localhost don't work.

But the odd thing was that when running on a CentOS image ping did work. As did manually compiling a Golang test program that connects to that name.

container$ cat dns_test.go
package main
import (
func main() {
res, err := http.Get("http://kube-healthy .localhost")
if err != nil {
} else {
bodyBytes, _ := ioutil.ReadAll(res.Body)

container$ go run dns_test.go
"OK": false,
"Errors": [
"Check execution error: kube-health/daemonset-check: timed out waiting for checker pod to report in",
"Check execution error: kube-health/deployment-check: timed out waiting for checker pod to report in"
"CheckDetails": {
"kube-health/daemonset-check": {
"OK": false,
"Errors": [
"Check execution error: kube-health/daemonset-check: timed out waiting for checker pod to report in"
"RunDuration": "",
"Namespace": "kube-health",
"AuthoritativePod": "kube-health-b8f9ff5fb-7d629",
"uuid": "32f736c7-3698-4671-8276-6cc8a8cb8b35"
"kube-health/deployment-check": {
"OK": false,
"Errors": [
"Check execution error: kube-health/deployment-check: timed out waiting for checker pod to report in"
"RunDuration": "",
"Namespace": "kube-health",
"AuthoritativePod": "kube-health-b8f9ff5fb-7d629",
"uuid": "d2ffdb29-de2c-40f9-8e00-62636336c332"
"CurrentMaster": ""

Recall that our CentOS image uses systemd with an /etc/nsswitch.conf that includes a myhostname plugin for hosts.

This causes a libc resolve of any hostname ending in ".localhost" or ".localhost.localdomain") are resolved to the IP addresses and ::1.

This explains why ping, curl, and other libc linked programs work. After failing a /etc/hosts lookup and a DNS query to coredns results in NXDOMAIN, it will finally fallback to the myhostname plugin which will resolve to

So why does our test program work, but the Kuberhealthy checker does not?

Recall that Golang attempts to use the go resolver, and does not use the cgo, libc linked resolver by default. So we’d expect it to fail in both the test program and Kuberhealthy.

Given certain features in /etc/nsswitch.conf, Golang will fallback to using the libc linked resolver. It turns out that myhostname is one of these features that will cause resolution using cgo if the hostname being resolved is "localhost"-like.

So, on our CentOS image, with myhostname NSS plugin present, we would now expect that the cgo resolver should always be used and that our queries for kube-health.localhost would be handled by myhostname NSS plugin and would succeed. Indeed, this is what we observe for the test app. But how do we explain the failure for the Kuberhealthy check Pod?

Well, it turns out that you can force go to always use the go resolver and never use the cgo resolver by disabling cgo during build time.

Kuberhealthy does this in their image build:


This forces use of the go resolver, and the DNS query follows the path in first part of this article. Removing that line will cause it to revert to dynamic resolver selection and fall back to myhostname NSS plugin on CentOS.

We can confirm this behavior by using the very helpful debug statements in the Go DNS package by enabling them with GODEBUG=netdns=2. With this environment variable present, the DNS package will log which resolver is used (and why) and how the DNS lookup will be performed.

Sure enough, with CGO_ENABLED=0, we see the same error as the original problem Get http://kube-health.localhost: dial tcp: lookup kube-health.localhost on no such host.

container$ CGO_ENABLED=0 GODEBUG=netdns=2 go run dns_test.go
go package net: built with netgo build tag; using Go's DNS resolver
go package net: hostLookupOrder(kube-health.localhost) = files,dns
Get http://kube-health.localhost: dial tcp: lookup kube-health.localhost on no such host
container$ GODEBUG=netdns=2 go run dns_test.go
go package net: dynamic selection of DNS resolver
go package net: hostLookupOrder(kube-health.localhost) = cgo
Get http://kube-health.localhost: dial tcp connect: connection refused

container$ CGO_ENABLED=1 GODEBUG=netdns=2 go run dns_test.go
go package net: dynamic selection of DNS resolver
go package net: hostLookupOrder(kube-health.localhost) = cgo
Get http://kube-health.localhost: dial tcp connect: connection refused

Now that we understand the problem, the fix is trivial. Since our Docker image is based on CentOS, we have a full libc available and simply remove the CGO_ENABLED=0 from the build. Problem solved, all systems go for Kuberhealthy.

If you’re interested in joining us, check out our open opportunities.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store