DNS could be a reason for your application’s degraded performance!

Published in

Rapido Labs

10 min readNov 29, 2021

Networking issues are almost always challenging and they teach us something new every time we come across them. We at Rapido faced one such challenge related to DNS resolutions.

Rapido has always been evolving its processes and offerings to be the best in the mobility space. To meet all of the demands, we introduced more & more microservices which continued to grow at a pace we never anticipated, and managing them was only getting twice as harder. To tackle this, we decided to migrate all our applications to Kubernetes. If you are curious to know more about how we managed to scale such massive infrastructure without impacting our users on the other end, check the below article out!

Behind the scenes: Migrating Rapido to a scalable infrastructure

In the last part, we explored some of the challenges (link) that Rapido was noticing with its platform. In this part…

medium.com

Just as we were celebrating this feat, we came across a daunting yet challenging issue!

During peak hours of operation, majority of the critical services that were running on NodeJS showed up errors — ETIMEDOUT & EAI_AGAIN, time & again! We had not observed such issues while the same services were running on VMs (dedicated servers). These errors were even more evident in services that received high traffic (>500 rps). This issue caused the response times to spike up on services which had this issue and it had a cascading effect on other services which hurt our operations during peak when we get bulk of our orders! Yup, that did hurt us real bad and was a frequent visitor. This pressured us to find and fix the cause as early as possible.

This article will talk more in depth about how we went on to identify this daunting issue & how we breathed a sigh of relief.

How did the issue surface?

It was first sighted during one of our peak hours of operation when alerts on HTTP status code 499s (request timeouts) increased. When we further looked at logs of the corresponding service (which happened to be a high traffic NodeJS service), we spotted plenty of requests that failed with errors — ETIMEDOUT & EAI_AGAIN.

Sample error log:
Error: getaddrinfo EAI_AGAIN service-name.namespace.svc.cluster.local\n at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:67:26)\n at GetAddrInfoReqWrap.callbackTrampoline (internal/async_hooks.js:131:14)

Below is a graph which shows the spike in response times at the time of issue. While we were looking at this issue, it also exposed some bad timeouts set on the clients which we went on to fix later.

The stack trace shared above was hinting at something to be wrong with DNS, and below observations confirmed it.

Below graph shows the time it took for the http client library to create a connection. 36.88 seconds! That is the time it took to create a connection! Thats huge amount of time just to establish a connection.

If we observe the above image closely, it is evident that the majority of time for the http request was spent on creating a connection. And the first step when creating a connection is DNS lookup to fetch the IP of the pod to establish a connection with.

Just as we spotted the above, we also observed a spike in NXDOMAIN errors at the same time which can be seen in the below graph.

NXDOMAIN simply indicates that no DNS record was found for a provided domain name. At the same time we also observed DNS query latencies shoot up.

All these observations made it clear to us that DNS was the culprit, but what caused it was unclear and we were on a manhunt to find the causal.

What caused the issue to occur?

While we were on a hunt to identify the cause, we came across quite a few articles & stackoverflow solutions which hinted at increasing UV_THREADPOOL_SIZE. This seemed like a legit fix as the NodeJS official documentation hinted at it as well:

DNS | Node.js v17.1.0 Documentation

An independent resolver for DNS requests. Creating a new resolver uses the default server settings. Setting the servers…

nodejs.org

Though the call to dns.lookup() will be asynchronous from JavaScript’s perspective, it is implemented as a synchronous call to getaddrinfo(3) that runs on libuv’s threadpool. This can have surprising negative performance implications for some applications, see the UV_THREADPOOL_SIZE documentation for more information.

The reference documentation in the above suggestion also mentioned the below:

Because libuv’s threadpool has a fixed size, it means that if for whatever reason any of these APIs takes a long time, other (seemingly unrelated) APIs that run in libuv’s threadpool will experience degraded performance. In order to mitigate this issue, one potential solution is to increase the size of libuv’s threadpool by setting the ‘UV_THREADPOOL_SIZE’ environment variable to a value greater than 4 (its current default value).

The above suggestion which seemed reasonable got us excited that it could help fix our DNS issues. We went on to increase the value to a higher value of 16. While this helped reduce the issue to an extent, sadly it was not averted!

ndots config it is!

Our debugging spree continued, and we came across multiple articles suggesting to move away from kube-DNS to CoreDNS. But we could not find a concrete reasoning that pushed us to move to CoreDNS. We were still researching on this and after hours of debugging, we arrived at what could be a possible cause of increased DNS lookup latencies & increased NXDOMAIN errors. It was due to the way a config option named ndots in the DNS resolver configuration file (this configuration file is what provides information to the DNS resolver as to how it should resolve DNS names) was configured by default on Kubernetes pods by the kubelet.

What was the problem with ndots?

It was simply the way ndots was configured. But first, let's try to understand how DNS resolution works in Kubernetes and what role ndots plays in resolving domain names.

How does DNS resolution work in Kubernetes and what was the problem?

DNS resolution inside a container — like any Linux system — is driven by the /etc/resolv.conf config file. This configuration file is what provides information to the DNS resolver as to how it should resolve DNS names. The kubelet running on each node configures the pod's /etc/resolv.conf to use the kube-dns service's ClusterIP. Every service defined in the kubernetes cluster is assigned a DNS name and when one service communicates with another, it uses these DNS names, and kube-dns is the DNS name server which helps in resolving these domain names to corresponding IPs. /etc/resolv.conf is simply additional instructions for the resolver to follow when resolving DNS names to IPs. When an application connects to a remote host specified by name, a DNS resolution is performed typically via a syscall, like getaddrinfo(). If the name is not fully qualified (not ending with a .), will the syscall try to resolve the name as an absolute one first, or will it go through the local search domains first? It depends on the ndots option.

Below is an example configuration of /etc/resolv.conf which was present in every pod:

nameserver 10.0.0.10
search default.svc.cluster.local svc.cluster.local cluster.local c.my-project-id.internal google.internal
options ndots:5

Let’s try to understand what each of the above options mean in short:

nameserver - IP address (IPv4/IPv6) of a name server that the resolver should query to resolve DNS names. If there are multiple servers, the resolver library queries them in the order listed. If no nameserver entries are present, the default is to use the name server on the local machine.
search - Search list for host-name lookup. Resolver queries having fewer than ndots dots (default is configured to 5 in kubernetes pods) in them will be attempted using each component of the search path in turn until a match is found.
ndots - Sets a threshold for the number of dots which must appear in a domain name before an initial absolute query will be made. The default in our case being 5, would mean that if there are 5 dots in a domain name, the name will be tried first as an absolute name before any search list elements are appended to it. The value for this option is silently capped to 15.

Below is an example to help understand the above options more clearly. Let us consider the same resolver config (/etc/resolv.conf) that is mentioned above with the nameserver as 10.0.0.10.

Let’s say a pod makes a call to a service named entities which is in a namespace named default using the domain name entities.default.svc.cluster.local.
The above domain name has 4 dots in the service URL and the value of ndots as per /etc/resolv.conf is 5.
Since the number of dots in the domain name (4) is less than the configured threshold value for ndots (5), the dns resolver syscall will try to resolve it by sequentially going through all local search domains by appending the elements in the search list first and in case none succeed, it will resolve it as an absolute name only at last. This would mean that the dns resolver syscall in the pod will have to query the kube-DNS server for below domain names until it queries the absolute name:
entities.default.svc.cluster.local.default.svc.cluster.local entities.default.svc.cluster.local.svc.cluster.local entities.default.svc.cluster.local.cluster.local entities.default.svc.cluster.local.c.my-project-id.internal entities.default.svc.cluster.local.google.internal
So, for each domain name resolution request, the pod will have to initiate 6 query requests to the kube-dns server!
In the above example, since the first 5 DNS resolutions fail, each of it leads to NXDOMAIN errors as there will be no DNS record found for them and this attributes to surge in NXDOMAIN error count we saw in the graph shared earlier.

Clearly, for each TCP connection established, the pod will issue 6 DNS queries before the name is correctly resolved, because it will go through the 5 local search domains first and will finally issue an absolute name resolution query. This adds on to the — additional latency & load on kube-dns pods and degraded performance of application during peak hours as the number of requests rise!

How did we fix the issue?

We clearly knew that the problem was due to more number of queries issued to kube-dns than necessary. To reduce the name resolution to only a single query for the absolute name passed, all we had to do was reduce the ndots value to 4 (as our service names had 4 dots in their domain names). We could achieve this by editing the deployment of the pods by adding the below DNS config:

dnsConfig:
  options:
    - name: ndots 4
      value: "4"

Now, when the domain name (entities.default.svc.cluster.local) of the service is to be resolved, and the number of dots in the domain name which is equal to the value of ndots (4 >= 4), the dns resolver syscall will only have make one query to the kube-dns server to get the name resolved to its corresponding IP.

We ran a load test as well to prove the above and we did observe a huge improvement in performance as expected! And, we were all set to deploy this change on production to see the impact and it had a great positive impact! Below are some observations post deployment in production on a single high rps NodeJS service.

Response times of the service dropped by 30–40%!

DNS queries & error rate dropped by half!

Once we observed such huge positive impact with the change in ndots value for a single service, we confidently went on to make the same change for all other services and the error rate dropped further with improved performance observed across multiple services!

We monitored for a few days, and we no longer observed the daunting EAI_AGAIN errors again. We finally breathed a sigh of relief and got our adrenaline pumped like every other time an issue is identified/resolved!

Why did we observe this issue only in NodeJS services?

While we are debugging to find the causal, this thought did strike our heads — why did we not see the same issue on Java & Golang services and why was it seen/aggravate only on NodeJS services?

On researching, this is what we came across —

NodeJS relies on the underlying OS to do the caching and sadly does not want to take the additional responsibility of caching DNS queries. Although there are some libraries which do support application level caching (cacheable-lookup for example, and the http client library Got which uses cacheable-lookup as a dependency). Although, we have not tested the efficacy of these caching libraries.
Java & Golang handle everything on their own, i.e., they cache the DNS requests with ttl’s and as a result applications written in Java/Golang did not show signs of negative performance.

What are our further action plans?

We do realise that reducing DNS queries to kube-dns service by reducing ndots to 4 did help mitigate the issue, but the same issue could hit us again at some point when we scale higher a few months/years down the line. We have a couple of other action items we would be working on meanwhile:

As an immediate action we will be reducing ndots further down to 2 to avoid multiple search requests for any external APIs which have fewer than 4 dots in the domain name. (Example: maps.googleapis.com)
Explore NodeLocal DNSCache to cache DNS queries which helps improve Cluster DNS performance by running a dns caching agent on cluster nodes as a DaemonSet. This would help improve the performance of any services written in languages that rely on OS to do the caching.

Are you interested in solving such interesting problems at scale? If so, we are looking out for such passionate Engineers to join our team in Bangalore. Feel free to reach out to mohan@rapido.bike if you want to chat about any open position available on www.rapido.bike/Careers or seeking for a referral.