Lessons from AWS NLB Timeouts
Jonathan Lynch, Alan Ning
This post covers a timeout issue discovered during migration from AWS ELB to NLB. For a summary, please skip to the Lessons Learned section.
In order to keep up with rapid growth, the SRE team at Tenable is modernizing the Tenable.io infrastructure. We are moving from monolithic regional Kubernetes (K8s) clusters managed by Ansible playbooks to rapidly deployable micro-sites with Terraform, Rancher, and Helm. Internally, the SRE team calls this effort Tenable.io Platform 2.0.
Elasticsearch is one of the core data engines within Platform 2.0. In the past, our Elasticsearch cluster was behind an AWS Classic Load Balancer (ELB). We chose to switch to an AWS Network Load Balancer (NLB) since it offers higher performance and lower latency at layer 4.
This small change introduced an unexpected amount of instability to our system and taught us valuable lessons about the NLB.
The Hunt for 500s
In order to gather system performance metrics, we deployed prototypes in our development environment and began refining them through extensive testing. One of these tests, which consisted of handling reports from 100,000 Nessus agents, exposed sporadic 500s coming from the platform and leaking into our user interface.
By investigating the logs from our web frontend, we determined that the 500s were coming from service-query, one of the microservices that makes up the platform. The service-query application occasionally had trouble establishing HTTPS connections to Elasticsearch. Since the cluster wasn’t raising any errors, the next logical step was to scrutinize the layer that stood between them, which were the load balancers.
Narrowing Down to TCP Resets
Since the NLB was one of the new technologies we had introduced to this stack, we first investigated its Cloudwatch metrics, pictured below.
The load balancer reset count metric was higher than we expected it to be. We suspected the resets were directly causing the 500s in the system. Since we had never seen this instability on our production platform, we decided to swap the NLB with a layer 4 ELB with the same Target Group. Soon after the swap, the problem went away. At this point we had a workaround, but we wanted to determine why we saw problems with the NLB that we didn’t see with the ELB. We hoped to gain this insight by looking into the service-query code.
In the Elasticsearch connection module that is shared among all micro-services, we observed that the connection TTL setting was set to infinite. This causes the connections in the pool to never time out, meaning the connections were timing out in the AWS Load Balancer.
NLB vs. ELB Timeout
When analyzing the 500s events from the service-query log files, we saw that the sockets were being closed disruptively after data was written to them. The connection was dead, but we hadn’t closed it, so we suspected that it was terminated by idle timeout. It appeared as though Platform 2.0 was not aware of connection termination via idle timeout. The difference in timeout behavior between ELB and NLB was likely the culprit. After digging deeper into AWS NLB documentation, we found that the documented timeout behavior matched our experience.
For each request that a client makes through a Network Load Balancer, the state of that connection is tracked. The connection is terminated by the target. If no data is sent through the connection by either the client or target for longer than the idle timeout, the connection is closed. If a client sends data after the idle timeout period elapses, it receives a TCP RST packet to indicate that the connection is no longer valid.
In other words, AWS NLBs silently terminates your connection upon idle timeout. If an application tries to send data on the socket after idle timeout, it receives an RST packet.
Since we were unable to find any documentation on the ELB’s timeout behavior, we decided to run a timeout simulation and use tcpdump to look into the traffic.
The timeout simulation was simple. We started an ELB and an NLB and attached an EC2 instance to each balancer. Within each EC2 instance, we ran an HTTP server. We then ran a TCP client to create an idle connection to each server. Upon idle timeout, we observed how each load balancer removed the expired connections.
ELB Timeout Behavior
An ELB’s idle timeout setting is adjustable, and defaults to 60 seconds. We used the default settings in our environments.
When connections expire through idle timeout, ELBs sends a FIN packet to each connected party. This translates to a socket closed event in the application layer.
NLB Timeout Behavior
NLBs have an idle timeout of 350 seconds which cannot be changed. When connections expire through idle timeout, NLBs terminate the connections silently. An application that is not aware of this timeout would attempt to send data to the same socket. At that point, the NLB would notify the application that the connection has been terminated by sending it an RST packet.
Our test results confirmed our suspicions. Upon idle timeout, the ELB sent FIN packets to the client and its targets. On the other hand, the NLB only sent RST packets to its clients when it received traffic following idle timeout.
In the end, we modified the Linux base system to set tcp_keepalive_time to 120s. Since Elasticsearch is already configured to use TCP Keepalives by default via the network.tcp.keep_alive setting, no changes were made to Elasticsearch configurations. When we switched back to NLBs after this change, the sporadic 500s errors we originally observed did not reoccur.
If you are migrating from AWS ELB to NLB and you rely on idle timeout, here are some recommendations:
1. Pay extra attention to the NLB Load Balancer Reset Count metric.
If this metric reports anything above zero, you may be losing connections silently. If this is expected, verify that your application is handling RST packets correctly. If this is not expected, see recommendation #2.
2. Consider enabling (and tuning) TCP keepalive in your Target.
Enabling TCP keepalive avoids silent connection failures. The default /proc/sys/net/ipv4/tcp_keepalive_time in Linux is 7200 (2 hours). Make sure you tune this parameter to well under 350s to avoid NLB timeouts.
3. Validate your timeout values across all OSI layers.
Layer 4 and layer 7 timeout inconsistencies across proxies and load balancers are often overlooked and hide subtle bugs in your system. Pay attention to these values and make sure they are set such that the timeout behavior is consistent and predictable.