Addressing Networking Errors in a Microservice Based System
SSENSE high-traffic periods — brought on by highly anticipated product drops or markdown season — have occasionally brought on some form of downtime to the website. Ssense.com generates an average of 100 million monthly page views, so increased traffic over a short period of time can not only lead to downtime but also to a less than perfect customer experience.
With this in mind, the SSENSE Tech Team made it our goal to reduce downtimes by achieving a minimum of 99.5% up-time and in turn improving the user experience. This was an ambitious goal given our high-traffic periods over the past 5+ years had yet to achieve such numbers. Yet, we felt confident to set this target and knew how imperative this was to ensure a seamless customer journey.
One of the downtime issues that were extremely high on our target list was intermittent networking issues between microservices.
The usual suspects were present:
- EAI_AGAIN
- ETIMEDOUT
- ECONNRESET
- ESOCKETTIMEOUT
- ECONNREFUSED
- socket hang-up
Every request to the website could potentially result in dozens of requests to the backend. When you add onto that the potential traffic spiking to thousands of requests per second, this leaves a lot of room for edge cases to occur.
Some services will retry these errors, compounding the issue further and resulting in backend systems being placed under even more pressure. This creates a snowball effect of cascading failures that could even take down an entire website.
Resolving DNS issues (EAI_AGAIN, ETIMEDOUT, ESOCKETTIMEDOUT)
CoreDNS and NodeLocal Cache
With Kubernetes, DNS requests are handled by coreDNS pods in the cluster. Upon inspection of the metrics, these coreDNS pods rendered no apparent issue. Since we are also using NodeLocal Cache, each node creates a local pod as a DaemonSet
that serves as a DNS caching mechanism. The metrics of these pods however told another story when performing benchmark performance tests across the stack.
While the coreDNS pods were fine, the NodeLocal Cache pods were hitting their max CPU usage and being starved for resources. Looking at the helm charts for these pods, we noticed that the limits on the CPU were very low. Armed with this knowledge, we were able to easily deploy code to fix it by changing the resources for these pods in the helm chart.
While we noticed improvements to the response time and a decrease in DNS related errors, it did not completely eradicate them.
Reducing the number of DNS requests
The easiest way to reduce DNS errors is to simply reduce the number of DNS requests in the first place. Usually every HTTP request does the following:
- Perform a DNS lookup
- Create a TCP connection
- Perform a TLS Handshake
- Request is sent to the server
- Response is sent to the client
- Finally, the connection is closed
By enabling HTTP Keep-Alive we maintain persistent connections and allow skipping of the first 3 steps: making a DNS query or lookup, creating a TCP connection, and performing the SSL handshake.
Despite that, not all our services had Keep-Alive enabled so we made sure to enable it for all of them. At that point most of the DNS errors disappeared.
This tactic also produced two other unforeseen benefits:
- A reduction in overhead establishing connections also caused a reduction in average response time.
- When DNS requests were reduced, it also caused a reduction in CPU consumption of the coreDNS and NodeLocal DNSCache containers, further decreasing latencies of DNS requests.
Resolving socket hang up and ECONNRESET errors in NodeJS
But the battle rages on… Even with fewer DNS related errors (post Keep-Alive enabling) we started noticing sharp increases of socket hang up and ECONNRESET errors.
We dug further into the NodeJS source code to see where the socket hang up error is being thrown and found:
In other words, it is being thrown when a socket was closed but did not receive any response.
Looking to the NodeJS documentation we found ECONNRESET described as:
A connection was forcibly closed by a peer.
For both socket hang up and ECONNRESET the request was being sent but before it received a response the server closed the connection. After reviewing NodeJS HTTP module docs we began hunting for any potential reasons why a NodeJS server would close a connection. We believed the keepAliveTimeout
setting to be our best bet because the documentation mentioned:
The number of milliseconds of inactivity a server needs to wait for additional incoming data, after it has finished writing the last response, before a socket will be destroyed.
In other words, if a socket has no further requests after 5 seconds then NodeJS will forcibly destroy it. This makes sense since we wouldn’t want to be keeping old unused connections around forever.
But what would happen if the socket is opened when the client is about to send out the request, but the server closes it before receiving it?
Let’s test it out:
We created a test server and sent 30 requests to it with Keep-Alive enabled. We know the server closes the socket after 5000ms so we wait 4999ms between each request to try and reproduce the race condition (i.e. server closing the idle socket while the request is being sent).
Here are the surprising results of this test:
Which reproduces our issue exactly!
One approach to fixing this would be to make sure that the client closes idle sockets before the server does. Within the default NodeJS Agent this option does not exist yet but agentkeepalive
offers a parameter freeSocketTimeout
which mirrors the NodeJS server side keepAliveTimeout
. This allows us to set a maximum idle time amount before the socket will be destroyed by the client. The trick is just to make sure we always destroy it before the server does. To do so, we set this value to be less than the NodeJS keepAliveTimeout
of 5000ms:
and then the issue disappears:
Interestingly, out of the box agentkeepalive
’s default value of 15000ms is affected by the race condition.
Another solution, as outlined in the NodeJS documentation, is to simply retry the request if the request was done on a reused socket:
We opted for changing the freeSocketTimeout
setting on the clients of all our microservices and we noticed the majority of ECONNRESET and socket hang ups were eliminated.
Armed with this newfound knowledge we were actually able to surface other areas that had also had this race condition.
Resolving socket hang-up/ECONNRESET errors when behind an AWS ELB/ALB
When a server has Keep-Alive enabled, the AWS load balancer keeps a persistent connection to the server as long as the connection is active. When the connection is idle for longer than the configured Idle Timeout
(default 60 seconds) the connection is dropped. Continuing with the above reasoning, we now simply need to ensure that the ALB’s Idle Timeout
is set to less than the server’s keepAliveTimeout
.
This can be done in one of two ways:
- Changing the idle timeout in AWS or
- Changing NodeJS
keepAliveTimeout
to be greater than the AWS default of 60 seconds.
In the second option, choosing the NodeJS side, it is important to note that there was a regression in how the headersTimeout
was calculated in certain NodeJS versions. The issue has been fixed in newer versions. For the versions with the bug, a workaround is needed to set the headersTimeout
to be larger than the keepAliveTimeout
.
Resolving socket hang-up/ECONNRESET errors in aws-sdk
This issue also surfaced while using aws-sdk
to upload files to S3 as well. By default Keep-Alive is enabled but idle sockets are never closed. This issue is not noticeable since aws-sdk
retries these errors silently.
We were able to notice the socket hang-ups by setting a custom retryStrategy
and logging every error message before a request is retried. Our application had a large number of concurrent multi-part uploads so this led to a higher likelihood of producing the race condition. Thankfully, all aws-sdk
clients allow the user to provide a custom Keep-Alive agent. And, by once again using agentkeepalive
and simply setting a lower value for freeSocketTimeout
the socket hang-ups stopped.
Resolving ECONNREFUSED / ECONNABORTED/ EHOSTUNREACH errors
Investigating Node Resources
With socket hang up/ECONNRESET errors taken care of, we turned to investigating the ECONNREFUSED errors by using datadog logs from a previous incident. We noticed that between 10:52am and 11:05am, there were spikes in different microservices errors.
All these services appeared to have a common denominator that was causing them all to throw errors at the same time. Upon review of the logs emitted and aggregated by node:
We saw that between the time frame of 10:52am and 11:05am, the host i-0272 (light blue in the graph above) began to emit logs irregularly and that the completed requests in that node all had very high response times.
We found the memory usage metrics of the host to be even more telling:
After more digging, we found that a specific machine cronjob started running in this node during this exact time frame and was using up all the available resources. Thus, explaining why only some requests for several different microservices started to fail. The busy node that they were hosted on turned out to be the common denominator.
This was easily fixed by ensuring that all these types of cronjobs were run in their own dedicated node group instead.
Graceful Shutdown
Graceful Shutdown race condition when using Kubernetes
When Kubernetes wants to shut down a pod it goes through several steps:
- The pod’s state is set to terminating in Kubernetes
While working in parallel:
- kubelet sends a SIGTERM signal to the pod
- kubelet waits for
terminationGracePeriodSeconds
- kubelet sends a SIGKILL signal to the pod
- The Endpoints Controller removes the pod’s IP from the Service’s Endpoints object
- kube-proxy receives a notification that the Endpoints object has changed and removes the IP from the iptables on every node
Since the SIGTERM signal and IP removal both happen in parallel, what happens if the server receives the SIGTERM signal and stops accepting connections before the IP has actually been removed? Another race condition. In that case some traffic will still be routed to the pod even though it is not accepting any new connections (or doesn’t even exist anymore), causing the client to return ECONNREFUSED/ECONNABORTED errors.
On our server, we must then intercept the SIGTERM and then make sure to add a delay before starting the regular safe shutdown procedure. This will give kube-proxy enough time to remove that pod’s IP from the iptables on every node and ensure that we don’t receive any more requests to our killed server.
Graceful Shutdown when starting a server using npm
If we start our server using npm, the SIGTERM signal is actually never received! For example, let’s say our package.json
has this script:
Now if we start the app by using the npm run start
command, when the container receives the SIGTERM signal, that signal will actually only be passed to npm and not our server itself. npm doesn’t pass the SIGTERM signal to the children processes. To help ensure that our server receives the SIGTERM signal, we should always ensure to just start our server using the node ./build/server.js
command instead.
Wrapping up
As you can see, our journey involved a series of steps and required us to look at the situation from different angles. By breaking down our problem into manageable parts we were able to tackle each issue one by one. Like most of the cases, no single issue was the cause of our problems, instead it was the result of the compounding effect of several factors.
Thanks to the SSENSE team’s robust monitoring and logging system (and a lot of Googling), finding the root cause of each issue was possible. Ultimately, we are now able to sustain high-traffic periods with close to zero downtime!
Editorial reviews by Liela Touré, and Mario Bittencourt. Want to work with us? Click here to see all open positions at SSENSE!