Beware of Transit Gateways Bearing Large Packets
In May of 2021, seek.com.au had an outage, resulting in some users seeing errors when searching for jobs. The outage lasted 16 minutes. Not great, but certainly not as impactful as recent high profile internet company outages.
What was interesting about this outage wasn’t how enormous or impactful it was. What was interesting was the root cause of the downtime. The investigation took us to a place that many developers fear: the guts of the TCP networking protocol. In this post, I’ll be walking you through some important features of the TCP protocol, notes on how AWS managed networking devices handle packets, and how all of these things combined to cause the outage mentioned above.
How big can a packet get?
In order to understand the root cause of this outage, it’s important to have an understanding of MTU and MSS, and how they interact with layer 3 routers.
The Maximum Transmission Unit (MTU) is the size (in bytes) of the largest IP packet that a particular network device is able to handle. There are related concepts at lower levels of the networking stack, but AWS handles these for us.
There is a separate but related concept in the TCP protocol called the Maximum Segment Size (MSS). MSS is an optional header added by both sides of a TCP connection that advertises the largest TCP segment that the sender wishes to receive. Cloudflare have a great page on What is MSS, that has more details.
In order to get our bearings, it’s useful to understand ‘typical’ values of these parameters. The standard MTU on the internet is 1500 bytes (largely for historical reasons, see How 1500 bytes became the MTU of the internet). This means that typically, MSS values are 1460 bytes. This is because MSS doesn’t include the size of the TCP header or the IP header. These are both 20 bytes, so 1500–20–20=1460.
In general, being able to transmit using a larger MTU is better. This is because packets have overhead, namely, the packet headers that are necessary to ensure the packets end up where they need to be. For every packet you send, you ‘waste’ a bit of the capacity of the packet in the headers. If you can stuff more data into a packet, your transmission is more efficient because the ratio of headers:data (the overhead of the packets) is better. You spend the same amount of bytes on headers, but you’ve sent more data for the cost of your headers. To enable this, some networking devices support an MTU of 9001. This is usually called ‘jumbo frames support’, in reference to the Ethernet feature of the same name.
In AWS, Elastic Network Interfaces (ENIs) seem to default to an MTU of 9001. This makes sense, as in many contexts, the MTU of virtual networking elements is 9001. For example, the MTU of a peering connection within Virtal Private Clouds (VPCs) in the same region is 9001. You want to default to the highest MTU possible, because that allows you to send packets with the greatest efficiency. As we’ll see below, sending larger packets than the intermediate routers can handle is usually no problem.
What happens when a packet is bigger than a routing device can handle?
Scenarios where devices along the path of a packet are not able to handle large IP packets is common. To deal with such scenarios, there are a few options: fragmentation and MSS clamping.
One option is for routers along the path to fragment the packet themselves. This means taking the IP packet and splitting it up into smaller packets that fit within the MTU of a given device. This has a number of issues associated with it, and in modern networking, fragmentation is considered fragile and potentially harmful. For a thorough treatment, this Stack Overflow question provides good reasons: How bad is ip fragmentation? What is relevant to us is that senders of IP packets can set a ‘Don’t Fragment’ header, which signals to routers that this packet shouldn’t be fragmented.
The ‘Don’t Fragment’ header needs an escape hatch from the router’s perspective. If the router can’t handle a particularly large packet, and the ‘Don’t Fragment’ header is set, the router can drop the packet, and return an ICMP ‘Fragmentation Needed’ packet to the sender to indicate that they’re unable to handle the packet. The sender is then supposed to fragment the packet themselves, and resend. This allows for something called Path MTU Discovery (PMTUD), where the networking stack keeps track of any ‘Fragmentation Needed’ messages it has received, and splits future packages up as appropriate to keep packet sizes small. This is the default behaviour of TCP, in that it always sets the ‘Don’t Fragment’ header, and uses PMTUD to discover the size of packet that will fit through all network devices.
The diagram below shows a packet flow between a server and a router, where the router’s MTU is smaller than the server’s. The router responds with a ‘Fragmentation Needed’ message to the server’s initial large packet, and then the server responds by fragmenting the large packet into two smaller ones.
Another method of resolving small MTUs on a path is call MSS clamping. This method is specific to TCP connections. Recall that MSS is a header field in TCP packets that indicates how large a packet the sender would like to receive (measured in bytes, minus IP and TCP headers). In MSS clamping, devices that see a TCP SYN or SYN ACK that contain an MSS that is larger than the MTU they can handle rewrite that MSS to be smaller. This signals to the receiver of the packet that they should send smaller packets which fit within the MTU of the intermediate device. But importantly, this is done transparently to the sender. The sender may be able to handle e.g. 9001 byte packets, so sets the MSS in a TCP SYN it sends appropriately. But a router along the way notices this, and knows that if the receiver of that SYN does try to send 9001 byte packets, the router will not handle them because its MTU is say 1500. To avoid issues, it ‘clamps’ the MSS in the SYN packet by rewriting the MSS value to 1500 (minus headers). The receiver then knows to only send packets of at most 1500 bytes.
In the diagram below, two servers, both with MTUs of 9001, are establishing a TCP connection with each other via a router that only has an MTU of 8000. However the router is configured with MSS clamping. Note that the SYN and SYN ACK packets that transit the router have their MSS changed to match the MTU of the router. When each end of the connection begins sending packets, they are 8000 byte packets as signalled by the clamped MSS, and so the router can handle them.
(Note: the above diagram is simplified, in that the MSS values are directly comparable to the MTU values, to show how the clamping works. As described above, the MSS value is actually 40 bytes smaller because it doesn’t include TCP or IP headers.)
Migrating from peering connections to transit gateways
Now that we’ve explained the background of MTU and MSS, we’ll get down to explaining what went wrong.
In the lead up to the outage, SEEK were in the process of performing a migration. This migration changed the way that traffic flowed between VPCs in our AWS environment. Previously, SEEK had set up VPC peering connections between VPCs. As many AWS users discover, this works fine for a while, but at some point, the number of peering connections explodes. Once you add more VPCs, you want to connect them up to the existing ones. This requires a new peering connection for every pair of VPCs. Now we’ve got a problem that scales like n², which is a classic bad situation.
The solution to this problem is to use AWS transit gateways. These allow you to solve the n² scaling by connecting all of your VPCs to a single transit gateway (n connections), and then configuring the transit gateway to send traffic between the various VPCs.
We wanted to migrate a number of peering connections to use our transit gateway instead. So we set about designing and testing a cutover process to allow the traffic between these VPCs to smoothly transition from using the peering connection to using the transit gateway, with zero downtime.
We came up with the following migration process:
- Connect both VPCs to the transit gateway, but without any routes. This effectively attaches the transit gateway to the VPCs, but leaves the connection unused, and traffic flows unchanged.
- Update the routing table in the first VPC to use the transit gateway.
- Update the routing table in the second VPC to use the transit gateway.
After step 2, but before step 3, traffic takes a different path out of the VPC as it does into the VPC. This is shown in the diagram below, and is crucial in understanding this outage.
How do peering connections and transit gateways handle large packets?
The important piece of information we didn’t have before performing this migration is that the MTU for a peering connection and a transit gateway are different.
Peering connections within the same region have an MTU of 9001. However Transit Gateways have an MTU of 8500.
There are also a few features of transit gateways that make them different to peering connections. As per the AWS Quotas for your transit gateway page, Transit Gateways:
- Drop packets larger than 8500 bytes (this is standard, and is an equivalent statement to having an MTU of 8500)
- Do not generate FRAG_NEEDED ICMP responses when packets are too large, meaning path MTU discovery is not supported
- Enforce MSS clamping for all packets.
With this important info out of the way, we’re ready to explain what went wrong.
What went wrong?
The outage was caused because all of the MTU resolution mechanisms we have mentioned failed to work as expected, when faced with a large packet and a specific set of networking conditions.
Firstly, TCP always sets the ‘Don’t Fragment’ bit. So the router fragmenting the packet for us cannot happen.
Secondly, Transit Gateways don’t support PMTUD, as discussed above. If a packet arrives at a Transit Gateway that is larger than the 8500 MTU, they are dropped silently, and no ICMP ‘Fragmentation Needed’ packet is returned. There are no metrics for these dropped packets. The only way we discovered that packets were being dropped was by looking at metrics on the transit gateway. The ‘packets in’ count at the transit gateway attachment on one VPC was roughly 2x the ‘packets out’ count at the transit gateway attachment on the other VPC. This is also documented on the Quotas for your transit gateway page. While not entirely out of place, things like ‘this device will drop packets and doesn’t follow normal methods for notifying you about it’ feels like less of a ‘quota’ and more of a ‘fundamental piece of information’ that should be easier to find.
Thirdly, MSS clamping only works if both sides of the TCP connection are using it. This is because MSS is not a negotiation, but an announcement. Both sides of the TCP connection add the MSS header to indicate the largest packet they can accept. Normally this works fine, because both packets in the TCP handshake will be clamped to the same MSS which will fit inside the MTU of the device doing the clamping. But what happens if only one side of the TCP connection has its MSS clamped?
After executing step 2 of the migration process above, when VPC A sends packets to VPC B, it uses the peering connection. Therefore, its packets are not MSS clamped. This indicates to VPC B that it can send packets of size 9001. When the VPC B sends packets back to VPC A, its packets go via the transit gateway, and so its MSS is clamped to 8500. But we’ve effectively clamped the wrong packet.
- The VPC A side of the connection thinks it should only send 8500 byte packets (because the MSS of the packet it received from the VPC B was clamped by the transit gateway). However these packets take the peering connection, which has an MTU of 9001.
- The VPC B side of the connection thinks it can still send 9001 byte packets (because the MSS of the packet it received from VPC A was not clamped by the peering connection), but the MTU of the transit gateway is 8500. When the packets get there they are dropped.
- Normally, the device dropping the packets would send back an ICMP ‘Fragmentation Needed’ message, but transit gateways don’t do this.
In the diagram below, VPC A sends packets of size 8500 bytes (at most) to VPC B, because it received the SYN ACK packet with the clamped MSS above. These arrive fine at the machine in VPC B. However VPC B received the un-clamped SYN packet with an MSS of 9001. The result is that it tries to send a 9001 byte packet, which is then dropped by the transit gateway with its MTU of 8500.
When the incident occurred, we were migrating the peering connection that handled traffic for searches on seek.com.au. As you can image, a search for ‘all jobs in Melbourne’ would return a large number of results, requiring multiple large packets that would need to be split into 9001 byte increments. We noticed fairly quickly that search wasn’t working, and so rolled back the migration.
We were able to validate this by replicating the setup of EC2 instances, transit gateways and peering connections. Requesting a large file being served by the machine in VPC B from VPC A via http caused the request to hang. Packet captures from both sides of the connection revealed what we expected: the machine in VPC B sends the large 9001 byte packet, but this never arrives at the machine in VPC A, and leads to endless TCP re-transmission.
Mitigation
Mitigation for this is fairly straightforward when you know what’s happening. Don’t send packets larger than 8500 bytes! For us, this meant lowering the MTU of EC2 instances in our environments to 8500. How to do this will vary based on your operating system, but for Amazon Linux 2 you can accomplish this by putting a script at /sbin/ifup-local
, which will be called every time a new network interface is brought up, with the name of the interface (e.g. eth0
) as an argument. You can use this script to override the default MTU of the instance, and set it to 8500.
Wrap up
This was a really interesting issue that taught us a lot about MTUs, MSS, and some of the finer points of managed AWS networking appliances. It’s not necessarily the severity of the issue that determines how much you learn from it, but often, it’s the curiosity and the drive to understand the root cause that leads to greater knowledge.
If these sorts of problems excite you, SEEK is hiring! We’re hiring developers across a wide range of roles, from delivering customer facing features to internal facing developer productivity, tooling, and shared platforms. Get in touch at nskoufis (at) seek.com.au