CoreDNS pods DNS resolution issue
Hey All I wanted to share my story about the recent dns issue which we faced.
Root cause: There was a minor network glitch which caused interruption between the cluster and DNS server which led to dns resolution timeout
Error:
[ERROR] plugin/errors: 2 svc-example.example-prod.svc.cluster.local.exmaple.com. A: unreachable backend: read udp : i/o timeout
[ERROR] plugin/errors: 2 master01.example.com. A: unreachable backend: read udp x.x.x.x:36063->x.x.x.x:53: i/o timeout
[ERROR] plugin/errors: 2 master02.example.com. A: unreachable backend: read udp x.x.x.x:36063->x.x.x.x:53: i/o timeout
[ERROR] plugin/errors: 2 master03.example.com. A: unreachable backend: read udp x.x.x.x:36063->x.x.x.x:53: i/o timeout
[ERROR] plugin/errors: 2 node1.example.com. A: unreachable backend: read udp x.x.x.x:36063->x.x.x.x:53: i/o timeout
[ERROR] plugin/errors: 2 node2.example.com. A: unreachable backend: read udp x.x.x.x:36063->x.x.x.x:53: i/o timeout
CoreDNS fail to resolve internal and external names(external: redis, splunk, etc) (internal: masters, nodes, svc, pods, etc)
As part of troubleshooting we started using busybox
to see if the dns resolution is happening as expected or not.
kubectl run -it --rm --restart=Never --image=busybox busybox
Run the below nslookup
command
/ # nslookup kubernetes.default.svc.cluster.localOutput:Server: 10.96.0.10
Address: 10.96.0.10#53*** Can't find kubernetes.default.svc.cluster.local: No answer
But the output seems to be showing that the resolution is not happening as expected. we suspected that something went bad with CoreDNS pods.
Then we came to know that some times busybox cannot do the proper dns resolution as the dns tools
so we spun a new pod which has the dns tools
kubectl run -it --rm --restart=Never --image=infoblox/dnstools:latest dnstools
Run the below nslookup
command.
dnstools# nslookup kubernetes.default.svc.cluster.localOutput:Server: 10.96.0.10
Address: 10.96.0.10#53Name: kubernetes.default.svc.cluster.local
Address: 10.96.0.1
The above output shows that dns resolution is happening.
But even after the network issue was resolved the issue in CoreDNS pods still exists.
Identifying the Issue
Our investigation into the DNS resolution issue unfolded as follows:
- Monitoring with Datadog: Upon receiving the Alerts , we focused on CPU and memory utilization across Kubernetes nodes.
- Detecting Abnormal Compute Usage: Datadog revealed an anomaly in CPU consumption on one of the nodes hosting CoreDNS pods. This unusual spike indicated that CoreDNS was monopolizing CPU resources, potentially starving other critical processes.
- Verification on the VM: To delve deeper, we accessed the affected VM directly. Here, we observed that the number of CoreDNS child processes had reached its maximum limit, indicating a resource exhaustion issue.
Mitigation Steps:
- Restarting the Pod: As an immediate measure, we restarted the CoreDNS pod. Scheduler scheduled the pod in a new node, thereby resuming the dns calls in the cluster.
- Removing the Starving Node: Subsequently, we decommissioned the node experiencing resource starvation from the Kubernetes cluster. This step prevented further disruptions and ensured other pods continued to operate smoothly.