UnknownHostException? CoreDNS fail? A Journey of Troubleshooting a National Platform Incident

Published in

Government Digital Services, Singapore

5 min readMay 30, 2024

Last month, the MOH’s National Platform encountered significant disruptions in the Kubernetes cluster managed via Amazon EKS. The issues included high latency, timeouts, and connection drops during DNS lookups, which disrupted ongoing business activity. We observed high occurrences of UnknownHostException throwing from the application.

UnknownHostException is a common error message in applications. This error typically indicates that there was a DNS resolution failure.

Our platform relies on AWS EKS Fargate for the managed Kubernetes and AWS AppMesh for the service mesh layer. Within the K8s cluster, we use the CoreDNS add-on as the DNS resolver. As we also consider CoreDNS a crucial component of our cluster, we are setting it up with HA and auto-scaling.

Challenges in Troubleshooting

Given our setup, we have several challenges when troubleshooting this incident.

EKS Fargate: Issues related to resource allocation or performance bottlenecks within the Fargate infrastructure could exist.
AWS AppMesh: As a service mesh layer, AWS AppMesh plays a crucial role in managing communication between services within your cluster. Misconfigurations or issues within AppMesh could lead to the observed latency, timeouts, and connection drops.
CoreDNS: DNS-related issues could also be a potential source of problems. Misconfigurations or performance issues within CoreDNS could lead to DNS lookup failures or delays.
Microservice Resilience and Configuration: The service might not have been designed with high resilience in mind, and the configuration of the services might also need to be included.

Troubleshooting Journey

First, our DevOps team investigated this incident by inspecting the CoreDNS configuration and log. We double-check to ensure configurations are correctly set up for resolving DNS queries within the Kubernetes cluster and pay special attention to upstream DNS servers, caching settings, and plugin configurations. We also reviewed CoreDNS logs for errors, warnings, or unusual patterns that could indicate DNS lookup failures or performance issues. Our team is looking for any correlation between DNS-related events and the reported disruptions in the cluster.

At the same time, our team also started examining all microservices. We started inspecting the configurations for all microservices to identify any potential missing or misconfigured settings that could contribute to the reported disruptions. Furthermore, we look for issues such as incorrect service endpoints, timeouts, retry policies, or dependencies on external services that could impact service communication and resilience.

We also started engaging with and raising support tickets to AWS. We exchanged our findings and troubleshooting guides with the AWS Support Team.

Seeing clues! Thanks to Observability!

Luckily, when we design our platform, we are taking account into the importance of observability.

While awaiting AWS’s responses, we immediately create a Grafana dashboard for CoreDNS. By monitoring CoreDNS metrics, we found anomalies in CPU, memory, and networking metrics for Fargate CoreDNS pods, which gave insights into potential issues.

From the dashboard, we found out that the loads to the CoreDNS were not balanced. A particular pod received more requests than other CoreDNS pods. Goroutines, CPU usage for the affected pod has higher metrics.

Hence, we started to perform experiments by terminating the affected pods and checking the occurrences of the UnknownHostException.

Immediately, we saw that the CoreDNS performance resumed to normal, and there were no exceptions during that period. However, after 2 hours and 30 minutes, the symptoms returned and haunted us.

With this finding, we immediately shared and escalated this with AWS and triggered the AWS internal team to investigate this issue. AWS could not isolate the root cause immediately as they were focused on looking for issues within the CoreDNS component. After further investigation, AWS suspected the issue could be on AppMesh and engaged the AppMesh team for assistance. Finally, they identified the extraneous IPv6 lookups as a contributing factor to the event and recommended the IPv6 setting on AppMesh be disabled. With this, we immediately remedied the problem with their recommendation.

Finally, the DNS requests are distributed across all the CoreDNS and no longer observed of any UnknowHostException.

Root Cause

AWS shared that the root cause of this issue was a latent software defect in AWS AppMesh. Platform applications that run on the Kubernetes cluster are configured to use AWS AppMesh, which generates DNS requests via AppMesh Proxy, where the Kubernetes cluster was configured with only IPv4 mode, with CoreDNS set to return “non-existent domain” (NXDOMAIN) for IPv6 endpoints (AAAA Record). However, AppMesh Proxy’s default configuration is to try an IPv6 endpoint and then an IPv4 endpoint for every DNS lookup, which doubled the number of DNS requests made to CoreDNS. The doubling of DNS requests led to a backlog of requests, and as a result, AppMesh Proxy began to reuse an existing UDP socket between it and the CoreDNS pod. This triggered the underlying defect, which caused most requests to “stick” to that pod, resulting in more traffic than it could handle.

Aftermath

Due to this incident, there is a backlog of tasks to clean up and patches for the BAU by our Day2Ops team.
AWS is following up with updating the AppMesh documentation to guide customers to configure their IPv4/IPv6 preferences
AWS will also fix the defect in AWS AppMesh to limit the
requests made on a single UDP socket.

Key Takeaway

Comprehensive Troubleshooting Approach: Thorough investigation, which examined various layers of infrastructure, including Kubernetes (EKS Fargate), service mesh (AWS AppMesh), DNS resolution (CoreDNS), and microservice configurations, helped to identify the root cause of the issue.
Importance of Observability: Our platform's emphasis on observability, with monitoring and dashboarding, played a crucial role in identifying anomalies and guiding their troubleshooting efforts. Observability tools provided insights into resource contention and helped validate the impact of their interventions.
Vendor Support Collaboration: Engaging with AWS support enabled the platform to leverage the expertise of AWS teams in diagnosing and resolving issues. Collaborating with the vendor expedited the troubleshooting process and led to the identification of the latent software defect in AWS AppMesh.

Finally, thanks to everyone who helped resolve this incident. Kudos to the DevOps Team (Alvin Siew, Peng Hiang Low, Ankit, MingSheng, DJ, and Sella)!!! Finally, we have a good night’s sleep after many sleepless nights.