Networking issues are always most challenging for me and at the same time most rewarding as I always learn something new. While working on big node eCOM backend that had a lot of traffic, from time to time we found getaddrinfo EAI_AGAIN error in our logs, quick googling explains that this means that our dns server can’t currently serve our request. And one of suggestion was to increase UV_THREADPOOL_SIZE to 128 (default value is 4).
getaddrinfo EAI_AGAIN · Issue #761 · googleapis/google-api-nodejs-client
You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…
this seems legit since node docs also mention this problem:
Node.js v13.0.1 Documentation
The dns module contains functions belonging to two different categories: Functions that use the underlying operating…
Though the call to
getaddrinfo(3)that runs on libuv's threadpool. This can have surprising negative performance implications for some applications, see the
UV_THREADPOOL_SIZEdocumentation for more information.
Because libuv’s threadpool has a fixed size, it means that if for whatever reason any of these APIs takes a long time, other (seemingly unrelated) APIs that run in libuv’s threadpool will experience degraded performance. In order to mitigate this issue, one potential solution is to increase the size of libuv’s threadpool by setting the
'UV_THREADPOOL_SIZE'environment variable to a value greater than
4(its current default value)
We tried this option without any deep investigation and it really seemed to help. So, we thought that we are hitting some limits in default node configuration and were happy with the solution and life was good for couple of months :)
While preparing for upcoming ultimate shopping mania sprees, we where running load tests across our whole system composed of bunch of different services. One team came back with report that they had serious issues with latency of dns lookups and error rates with coredns service in our kuberntes cluster even with many replicas and of course they already had UV_THREADPOOL_SIZE magic number fine tuned. Their solution to the problem was to include https://www.npmjs.com/package/lookup-dns-cache package in order to cache dns lookups. In their load tests it showed amazing results by improving performance 2x. At first everybody was happy, but to me this was very very weird and completely against how I believed dns resolution should work in the first place, it was time for some🕵🏻♂️ work.
Why is node having issues with dns lookups? How can it be that node has such bad dns resolution handling?
My biggest problem with above solution was that it doesn’t make any sense to implement dns caching on application level, it should be handled by node runtime or on OS level, so I wanted to figure out how node deals with it internally and what we didn’t setup correctly. Also handling it in application means that all apps written in any other language need to figure out a way to cache dns lookups their own and were potentially affected.
Back to the drawing board, or should I say googling. I saw that there is ton of posts and stackoverflow discussions about this topic, but it is hard to draw any conclusion because everybody has different setup and there could be ton of factors causing this behavior. Going back to the original node docs above, it states that node net module is using dns.lookup which is calling native getaddrinfo function by default which is synchronous so it needs to use libuv threadpool to make it behave as async. People where suggesting to bypass dns.lookup and to use dns.resolve instead as it does network call which has async native API so no need to hit libuv threadpool and this is exactly what lookup-dns-cache does, so that is just a workaround and not telling what is the original issue. Also getaddrinfo function is taking configuration from hosts file, which dns.resolve obviously doesn’t, and that can cause unexpected behavior and we do use hosts file in development and having app behave differently across environments can be a bomb waiting to go off any time.
One interesting thing I quickly found is that getaddrinfo is not returning TTL when resolving queries, this was clear indicator that node is for sure not handling dns cache and that it expected OS to handle that.
When trying to figure out how getaddrinfo actually works I found this amazing article that goes through it’s source code and explains each step:
What does `getaddrinfo` do?
The following C program calls getaddrinfo("google.com", ...), a function from sys/socket.h. On the face of it…
This step is was the key:
nscdis actually the “name service cache daemon”, “a daemon that provides a cache for the most common name service requests”. After installing it, the daemon starts, and your process can dance
Of course we are running in docker in kubernetes so for sure our pods don’t have nscd installed and after talking to platform team we figured out that there was nothing on kubernetes node level that was doing caching so all requests where hitting coredns service which broke under heavy load 🤦♂️
Platform team decided to setup caching on node level by using:
This will fix the issue for node and any other runtime that relies on OS to do dns query caching.
Don’t jump the gun and copy paste solution from stackoverflow blindly, try to dig deeper and understand how things work under the hood.
There is nothing wrong with how node handles dns queries it just expects OS to handle the caching part, you just need to be aware of this.
PS. I wanted to also figure out if Java apps have similar problem since we are using Java a lot as well. As I expected nothing less from Java it handles everything on it’s own:
The java.net package, when doing name resolution, uses an address cache for both security and performance reasons. Any address resolution attempt, be it forward (name to IP address) or reverse (IP address to name), will have its result cached, whether it was successful or not, so that subsequent identical requests will not have to access the naming service. These properties allow for some tuning on how the cache is operating.
networkaddress.cache.ttl (default: see below)
Value is an integer corresponding to the number of seconds successful name lookups will be kept in the cache. A value of -1, or any other negative value for that matter, indicates a “cache forever” policy, while a value of 0 (zero) means no caching. The default value is -1 (forever) if a security manager is installed, and implementation specific when no security manager is installed.
networkaddress.cache.negative.ttl (default: 10)
Value is an integer corresponding to the number of seconds an unsuccessful name lookup will be kept in the cache. A value of -1, or any negative value, means “cache forever”, while a value of 0 (zero) means no caching.
Never though I would say this for Java, but for cross platform runtime it actually makes sense to provide dns cache instead of relying on OS, but having default value -1 = forever, doesn’t make any sense so it’s 1:1 for Java 😀(don’t know what is security manager so I’m ignoring that part in docs on purpose 😉)