How to fix nodejs DNS issues?

Photo by Taylor Vick on Unsplash

Networking issues are always most challenging for me and at the same time most rewarding as I always learn something new. While working on big node eCOM backend that had a lot of traffic, from time to time we found getaddrinfo EAI_AGAIN error in our logs, quick googling explains that this means that our dns server can’t currently serve our request. And one of suggestion was to increase UV_THREADPOOL_SIZE to 128 (default value is 4).

this seems legit since node docs also mention this problem:

Though the call to dns.lookup() will be asynchronous from JavaScript's perspective, it is implemented as a synchronous call to getaddrinfo(3) that runs on libuv's threadpool. This can have surprising negative performance implications for some applications, see the UV_THREADPOOL_SIZE documentation for more information.

and from: https://nodejs.org/api/cli.html#cli_uv_threadpool_size_size

Because libuv’s threadpool has a fixed size, it means that if for whatever reason any of these APIs takes a long time, other (seemingly unrelated) APIs that run in libuv’s threadpool will experience degraded performance. In order to mitigate this issue, one potential solution is to increase the size of libuv’s threadpool by setting the 'UV_THREADPOOL_SIZE' environment variable to a value greater than 4 (its current default value)

We tried this option without any deep investigation and it really seemed to help. So, we thought that we are hitting some limits in default node configuration and were happy with the solution and life was good for couple of months :)

While preparing for upcoming ultimate shopping mania sprees, we where running load tests across our whole system composed of bunch of different services. One team came back with report that they had serious issues with latency of dns lookups and error rates with coredns service in our kuberntes cluster even with many replicas and of course they already had UV_THREADPOOL_SIZE magic number fine tuned. Their solution to the problem was to include https://www.npmjs.com/package/lookup-dns-cache package in order to cache dns lookups. In their load tests it showed amazing results by improving performance 2x. At first everybody was happy, but to me this was very very weird and completely against how I believed dns resolution should work in the first place, it was time for some🕵🏻‍♂️ work.

Why is node having issues with dns lookups? How can it be that node has such bad dns resolution handling?

My biggest problem with above solution was that it doesn’t make any sense to implement dns caching on application level, it should be handled by node runtime or on OS level, so I wanted to figure out how node deals with it internally and what we didn’t setup correctly. Also handling it in application means that all apps written in any other language need to figure out a way to cache dns lookups their own and were potentially affected.

Back to the drawing board, or should I say googling. I saw that there is ton of posts and stackoverflow discussions about this topic, but it is hard to draw any conclusion because everybody has different setup and there could be ton of factors causing this behavior. Going back to the original node docs above, it states that node net module is using dns.lookup which is calling native getaddrinfo function by default which is synchronous so it needs to use libuv threadpool to make it behave as async. People where suggesting to bypass dns.lookup and to use dns.resolve instead as it does network call which has async native API so no need to hit libuv threadpool and this is exactly what lookup-dns-cache does, so that is just a workaround and not telling what is the original issue. Also getaddrinfo function is taking configuration from hosts file, which dns.resolve obviously doesn’t, and that can cause unexpected behavior and we do use hosts file in development and having app behave differently across environments can be a bomb waiting to go off any time.

One interesting thing I quickly found is that getaddrinfo is not returning TTL when resolving queries, this was clear indicator that node is for sure not handling dns cache and that it expected OS to handle that.

When trying to figure out how getaddrinfo actually works I found this amazing article that goes through it’s source code and explains each step:

This step is was the key:

Sorry, nscd is actually the “name service cache daemon”, “a daemon that provides a cache for the most common name service requests”. After installing it, the daemon starts, and your process can dance

Of course we are running in docker in kubernetes so for sure our pods don’t have nscd installed and after talking to platform team we figured out that there was nothing on kubernetes node level that was doing caching so all requests where hitting coredns service which broke under heavy load 🤦‍♂️

Platform team decided to setup caching on node level by using:

https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/

This will fix the issue for node and any other runtime that relies on OS to do dns query caching.

Conclusion

Don’t jump the gun and copy paste solution from stackoverflow blindly, try to dig deeper and understand how things work under the hood.

There is nothing wrong with how node handles dns queries it just expects OS to handle the caching part, you just need to be aware of this.

PS. I wanted to also figure out if Java apps have similar problem since we are using Java a lot as well. As I expected nothing less from Java it handles everything on it’s own:

Address Cache
The java.net package, when doing name resolution, uses an address cache for both security and performance reasons. Any address resolution attempt, be it forward (name to IP address) or reverse (IP address to name), will have its result cached, whether it was successful or not, so that subsequent identical requests will not have to access the naming service. These properties allow for some tuning on how the cache is operating.

networkaddress.cache.ttl (default: see below)
Value is an integer corresponding to the number of seconds successful name lookups will be kept in the cache. A value of -1, or any other negative value for that matter, indicates a “cache forever” policy, while a value of 0 (zero) means no caching. The default value is -1 (forever) if a security manager is installed, and implementation specific when no security manager is installed.

networkaddress.cache.negative.ttl (default: 10)
Value is an integer corresponding to the number of seconds an unsuccessful name lookup will be kept in the cache. A value of -1, or any negative value, means “cache forever”, while a value of 0 (zero) means no caching.

Never though I would say this for Java, but for cross platform runtime it actually makes sense to provide dns cache instead of relying on OS, but having default value -1 = forever, doesn’t make any sense so it’s 1:1 for Java 😀(don’t know what is security manager so I’m ignoring that part in docs on purpose 😉)

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store