If you’ve got here because the title caught your attention, then chances are you’ve struggled before with some DNS related issues using Node.js. These might appear as the infamous EAI_AGAIN or even the widely popular ETIMEDOUT, which happened to me because I’ve set a timeout limit to the HTTP requests.
In my case, my company’s service recently experienced a sudden increase in usage, which led these problems to occur more often, even to the point of causing outages. Our service architecture follows a very common pattern, in which, in order to fulfill one user request, we have to call a handful of APIs, then process their results and finally reach back to the user with a proper response.
With the spike in traffic, we started to see a lot of ETIMEDOUT errors, and when we looked closely into it, we noticed that requests were not reaching the target hosts, meaning they weren’t even been made by the client. All of the timeouts were occurring while trying to establish the connection, more precisely, while trying to resolve the servers hostnames to IPs addresses.
Whatever the symptoms you’re facing or may have come across, you should probably know by now that, although the HTTP calls in Node can be asynchronous, the hostname resolution is usually made by calling the also asynchronous dns.lookup(), which in turn, makes a synchronous call to a low-level function running on a fixed number of threads.
For more information on this, take a look at: https://nodejs.org/api/dns.html#dns_implementation_considerations
Also, from the same document, we can see that:
And that’s where the problem lies. By default, there’ll only be 4 threads available for each Node process, as stated here: https://nodejs.org/api/cli.html#cli_uv_threadpool_size_size
Because libuv’s threadpool has a fixed size, it means that if for whatever reason any of these APIs takes a long time, other (seemingly unrelated) APIs that run in libuv’s threadpool will experience degraded performance. In order to mitigate this issue, one potential solution is to increase the size of libuv’s threadpool by setting the ‘UV_THREADPOOL_SIZE’ environment variable to a value greater than 4 (its current default value).
The implications are that, as the text says, seemingly unrelated APIs calls might start to fail because of a race condition during the hostname resolution. Here’s one great article that describes this exact problem faced by Uber: https://eng.uber.com/denial-by-dns/
Resolving localhost to ::1 (which is needed to connect to the local sidecar) involves calling a synchronous getaddrinfo(3). This operation is done in a dedicated thread pool (with a default of size 4 in Node.js). We discovered that these long DNS responses made it impossible for the thread pool to quickly serve localhost to ::1 conversions.
As a result, none of our DNS queries went through (even for localhost), meaning that our login service was not able to communicate with the local sidecar to test username and password combinations, nor call other providers. From the Uber app perspective, none of the login methods worked, and the user was unable to access the app.
That’s why, in this situation, DNS issues are even more aggravating, because of the snowball effect caused by the unavailability of one service that might affect other (seemingly) unrelated services.
With only this information, one could think of the short-term solution consisting of only increasing the number of threads by setting UV_THREADPOOL_SIZE to a reasonable value. And it might work … in some cases. But it didn’t for me, though.
So, I continued my search and found this other great article: https://medium.com/@amirilovic/how-to-fix-node-dns-issues-5d4ec2e12e95
It explains in better details all the points addressed so far (and served as inspiration while writing this article), also it talks about their experience messing around with and fine tuning the threadpool size:
While preparing for upcoming ultimate shopping mania sprees, we where running load tests across our whole system composed of bunch of different services. One team came back with report that they had serious issues with latency of dns lookups and error rates with coredns service in our kuberntes cluster even with many replicas and of course they already had UV_THREADPOOL_SIZE magic number fine tuned. Their solution to the problem was to include https://www.npmjs.com/package/lookup-dns-cache package in order to cache dns lookups. In their load tests it showed amazing results by improving performance 2x.
Finally it offers a great solution to these problems, by enabling the caching service on node level on their kubernetes cluster.
More on this here: https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/
So far, so good. It all made sense. And I happily explained the situation to the Ops Team expecting it to be an easy fix. But they were worried that this change could cause some side effects and even affect other services running on the same cluster. So, it was a no-go.
Back on my search, I tumbled upon another helpful article not entirely related to DNS issues, but its insights could be applied to solve this kind of problems and also provide other benefits. It described the upsides of reusing HTTP connections with the HTTP Keep-Alive functionality.
Here’s the text: https://lob.com/blog/use-http-keep-alive
As it goes:
One of the best ways to minimize HTTP overhead is to reuse connections with HTTP Keep-Alive. This feature is commonly enabled by default for many HTTP clients. These clients will maintain a pool of connections — each connection initializes once and handles multiple requests until the connection is closed. Reusing a connection avoids the overhead of making a DNS lookup, establishing a connection, and performing an SSL handshake. However, not all HTTP clients, including the default client of Node.js, enable HTTP Keep-Alive.
One of Lob’s backend services is heavily dependent on internal and external APIs to verify addresses, dispatch webhooks, start AWS Lambda executions, and more. This Node.js server has a handful of endpoints that make several outgoing HTTP requests per incoming request. Enabling connection reuse for these outgoing requests led to a 50% increase in maximum inbound request throughput, significantly reduced CPU usage, and lowered response latencies. It also eliminated sporadic DNS lookup errors.
The key benefit here was the fact that, by reusing HTTP connections, the number of calls made to the DNS service decreased (a lot), eliminating the race condition that caused all my problems. And, as an added bonus, it also increases the performance by avoiding the costs of establishing a new HTTP connection, like the SSL handshake and the slow-start of the TCP protocol (more on this here https://hpbn.co/building-blocks-of-tcp/#slow-start).
Following the article’s recommendation, we changed the APIs calls and started to use the agentkeepalive lib (https://github.com/node-modules/agentkeepalive). The results were amazing. All the DNS issues and timeouts were gone.
We were very happy with the results, but then we started to see some side effects (it’s as they say: “no good deed goes unpunished”). But nothing as bad as before. Actually, it was a small price to pay for the improvements we’d made so far and, as it turned out, was for the better. As the article about the HTTP Keep-Alive mentioned (https://lob.com/blog/use-http-keep-alive):
In some cases, reusing connections can lead to hard-to-debug issues. Problems can arise when a client assumes that a connection is alive and well, only to discover that, upon sending a request, the server has terminated the connection. In Node, this problem surfaces as an Error: socket hang up.
To mitigate this, check the idle socket timeouts of both the client and the server. This value represents how long a connection will be kept alive when no data is sent or received. Make sure that the idle socket timeout of the client is shorter than that of the server. This should ensure that the client closes a connection before the server, preventing the client from sending a request down an unknowingly dead connection.
So, all that was left, was to check for a possible closed connection before trying to send a new request down a reused socket. But, while trying to implement this change we noticed that our, at the time, current HTTP lib didn’t offer any easy way to handle specific errors that could occur during the HTTP request. We were using the broadly known request/request-promise (https://www.npmjs.com/package/request), which is now deprecated. And that’s why we decided to make a change.
The alternative we chose, is the also very popular, actively maintained and feature-rich lib called got (https://github.com/sindresorhus/got). Just to name a few interesting features, here’s a list of the ones that appealed to us:
- Retries on failure (https://github.com/sindresorhus/got#retry)
- HTTP Keep-Alive (https://github.com/sindresorhus/got#agent)
- Timeout handling (https://github.com/sindresorhus/got#timeout)
- Caching (https://github.com/sindresorhus/got#cache-1)
- DNS caching (https://github.com/sindresorhus/got#dnscache)
- Hooks (https://github.com/sindresorhus/got#hooks)
The changes we’ve made, besides the HTTP lib per se, involved defining and applying a set of default options to all the HTTP calls. Basically, we implemented a custom retry handler function (represented by the option calculateDelay as explained here https://github.com/sindresorhus/got#retry), which automatically retries any HTTP request in case of common connection errors (ECONNRESET, EADDRINUSE, ECONNREFUSED, EPIPE, ENOTFOUND, ENETUNREACH, EAI_AGAIN) or certain HTTP Status Codes (429, 500, 502, 503, 504, 521, 522 and 524).
We also took the opportunity to put in place a set of more reasonable timeout and delay values. For example, while using the old request/request-promise, we could only define a single timeout value which would be applied to the HTTP request as a whole. Using the new got lib and creating a custom retry handler function, we make use of the error metadata provided by the lib and check if the timeout happened during the ‘lookup’, ‘connect’, ‘secureConnect’ or ‘socket’ phase and, if so, we apply a retry policy (number of retries and delay until the next retry) differently from the timeout that might occur during the ‘response’ phase.
It’s also possible to define different timeout values for each specific phase of the HTTP request, as described here: https://github.com/sindresorhus/got#timeout.
The final result is better than what we expected. We started by looking at a problem of what seemed to be a misbehaving service with performance issues due to the first timeout errors we saw, and ended up with a more resilient and fault-tolerant system (due to the error handling), the added bonus of an increase in performance (by reusing the HTTP connections) and better user experience (due to the revised and more reasonable timeout values).
It was a journey for us and a first hand lesson about how a seemingly well-known and debated problem could pose itself in a different light and that not always the first solution you find will be the best one. More importantly, the mindset you have when approaching a problem, even if it isn’t a new one, is what will make the difference. A curious mind should never settle with an easy answer.