CoreDNS pods DNS resolution issue

Mohan P Edala
3 min readOct 21, 2019

--

Hey All I wanted to share my story about the recent dns issue which we faced.

Root cause: There was a minor network glitch which caused interruption between the cluster and DNS server which led to dns resolution timeout

Error:

[ERROR] plugin/errors: 2 svc-example.example-prod.svc.cluster.local.exmaple.com. A: unreachable backend: read udp : i/o timeout
[ERROR] plugin/errors: 2 master01.example.com. A: unreachable backend: read udp x.x.x.x:36063->x.x.x.x:53: i/o timeout
[ERROR] plugin/errors: 2 master02.example.com. A: unreachable backend: read udp x.x.x.x:36063->x.x.x.x:53: i/o timeout
[ERROR] plugin/errors: 2 master03.example.com. A: unreachable backend: read udp x.x.x.x:36063->x.x.x.x:53: i/o timeout
[ERROR] plugin/errors: 2 node1.example.com. A: unreachable backend: read udp x.x.x.x:36063->x.x.x.x:53: i/o timeout
[ERROR] plugin/errors: 2 node2.example.com. A: unreachable backend: read udp x.x.x.x:36063->x.x.x.x:53: i/o timeout

CoreDNS fail to resolve internal and external names(external: redis, splunk, etc) (internal: masters, nodes, svc, pods, etc)

As part of troubleshooting we started using busyboxto see if the dns resolution is happening as expected or not.

kubectl run -it --rm --restart=Never --image=busybox busybox

Run the below nslookup command

/ # nslookup kubernetes.default.svc.cluster.localOutput:Server:         10.96.0.10
Address: 10.96.0.10#53
*** Can't find kubernetes.default.svc.cluster.local: No answer

But the output seems to be showing that the resolution is not happening as expected. we suspected that something went bad with CoreDNS pods.

Then we came to know that some times busybox cannot do the proper dns resolution as the dns tools

so we spun a new pod which has the dns tools

kubectl run -it --rm --restart=Never --image=infoblox/dnstools:latest dnstools

Run the below nslookup command.

dnstools# nslookup kubernetes.default.svc.cluster.localOutput:Server:  10.96.0.10
Address: 10.96.0.10#53
Name: kubernetes.default.svc.cluster.local
Address: 10.96.0.1

The above output shows that dns resolution is happening.

But even after the network issue was resolved the issue in CoreDNS pods still exists.

Identifying the Issue

Our investigation into the DNS resolution issue unfolded as follows:

  • Monitoring with Datadog: Upon receiving the Alerts , we focused on CPU and memory utilization across Kubernetes nodes.
  • Detecting Abnormal Compute Usage: Datadog revealed an anomaly in CPU consumption on one of the nodes hosting CoreDNS pods. This unusual spike indicated that CoreDNS was monopolizing CPU resources, potentially starving other critical processes.
  • Verification on the VM: To delve deeper, we accessed the affected VM directly. Here, we observed that the number of CoreDNS child processes had reached its maximum limit, indicating a resource exhaustion issue.

Mitigation Steps:

  • Restarting the Pod: As an immediate measure, we restarted the CoreDNS pod. Scheduler scheduled the pod in a new node, thereby resuming the dns calls in the cluster.
  • Removing the Starving Node: Subsequently, we decommissioned the node experiencing resource starvation from the Kubernetes cluster. This step prevented further disruptions and ensured other pods continued to operate smoothly.

--

--