NLB Connection Resets
Using/consuming AWS’s NLB or NAT Gateway? There’s something you should know.
I was load testing our next generation platform to get us to performance parity with our existing platform and kept getting connection resets at very low load.
To be clear, I didn’t find a bug. This is one of those issues AWS probably closed as “By design”.
The setup
The load generator is a single Locust.io agent (I was able to reproduce the connection resets with just one) in a VPC in a private subnet so it communicates through a NAT Gateway.
We have a 2-tier proxy setup: NLB at the edge and a reverse proxy where we’re able to do more intelligent L7 routing, traffic shaping, etc. Behind the reverse proxy is Some service just for completeness, but it’s irrelevant for this post.
The load generation cluster and target cluster are in separate VPCs (not diagrammed). You can see 2 connections and that cross-zone load balancing is enabled.
(note: there’s not an actual connection to the Logical NLB it just makes the issue easier to diagram)
Some background
Interestingly enough, cross-zone is disabled in NLB by default and I’m pretty sure I know why now.
With Application Load Balancers, cross-zone load balancing is always enabled.
With Network Load Balancers, cross-zone load balancing is disabled by default. After you create a Network Load Balancer, you can enable or disable cross-zone load balancing at any time
NLB preserves source IP:port when specifying targets, which we do.
If you specify targets using an instance ID, the source IP addresses of the clients are preserved and provided to your applications. If you specify targets by IP address, the source IP addresses are the private IP addresses of the load balancer nodes.
From the NAT Gateway docs:
A NAT gateway can support up to 55,000 simultaneous connections to each unique destination
Finally, one last bit of background is needed: RFC 5961 section 4.2, and I encourage you to click the link on this one.
The analysis of the reset attack using the RST bit highlights another possible avenue for a blind attacker using a similar set of sequence number guessing. Instead of using the RST bit, an attacker can use the SYN bit with the exact same semantics to tear down a connection.
How I got to RFC 5961
If you’re confused, I don’t blame you. I was too. This implied NLB was attacking us.
The journey to the RFC started with a tcpdump for the reverse proxy which showed the connection resets (the proxy listens on port 6000). There were thousands of successful request/response pairs in this dump. Segments 1–55 on this TCP stream were fine too. You can see the RST our proxy sent in segment 60.
Before the reset, however, was this SYN in segment 56 with a sequence number of 0. It looked like NLB was trying to establish a connection on an already established connection.
I had read that a full SYN backlog could cause connection resets (not true, it turns out) so I explored increasing the SYN backlog and accept()
queue. When that didn’t work I tried Brendan Gregg’s Drunk Man Anti-method with some other kernel parameters, followed by the “Google It” method which led me to some other kernel tuning around TCP idle timeouts that didn’t help.
I decided I needed a real method, starting with the RTFM method. I bought TCP/IP Illustrated Volume 1 but, while wonderfully informative, it didn’t end up helping.
Knowing that diving into the TCP implementation in the Linux Kernel (13,000+ lines of extremely terse C) was not going to yield results quickly, I did the Divide and Conquer method. I cut the data path in half and looked for issues on one side, then repeated along the path I could reproduce the issue. I’d find where to look more closely (I was even suspicious of Locust.io by then) in log2(hops in data path)
When I bypassed NLB the issue went away, so I opened a support ticket with AWS.
Meanwhile I poured through Kernel source and found two places where an ACK is sent in response to a SYN.
- https://github.com/torvalds/linux/blob/v4.19/net/ipv4/tcp_input.c#L5428
- https://github.com/torvalds/linux/blob/v4.19/net/ipv4/tcp_input.c#L5435-L5445
Both were this tcp_send_challenge_ack
function called from tcp_validate_incoming
. Note the comment about RFC 5961 in the second link.
To confirm this was the source I wrote a SystemTap script. Big thanks to Danny Kulchinsky for his post on SystemTap for CoreOS Container Linux or getting SystemTap running would have taken a lot longer.
I ran my load tests, and on the first connection reset I came back to my terminal to find
tcp_send_challenge_ack local=10.10.12.186:6000 remote=34.217.253.255:47442
Only one of the two call sites has
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPSYNCHALLENGE);
netstat -ts
showed the SYN challenge counter increase by 1 for each connection reset so I had 100% confirmed tcp_input.c
line 5443 as the source of the RST. The Kernel was challenging the SYN per the RFC.
I thought I’d found a bug in NLB, and updated my support ticket accordingly.
The issue
So what actually happened?
A picture’s worth a thousand words (and mine even has words in it).
Follow connection 1 down the left side first:
NAT Gateway happened to use port 47442
on 2 separate requests to what it saw as 2 unique destinations, as designed. Both “physical” NLBs preserved 34.217.253.255:47442
and RFC 5961 section 4.2 kicked in the kernel we run.
Closed: by design
In my support ticket with AWS they acknowledge the issue and said they don’t have a timeline for a fix; and why would they? Both products are working as designed.
I strongly suspect ELB and ALB are traditional reverse proxies where you would see their private IP and an ephemeral port for their outbound connections to an upstream service. If so, they wouldn’t suffer from this issue which is why we never saw it before.
The workaround? Just disable cross-zone load balancing. If you’re running stateless compute with more than a few instances per AZ, or a second proxy tier like us, you don’t need it anyway.