Operation Jumbo Drop: How sending large packets broke our AWS network

Published in

In The Hudl

12 min readMar 23, 2022

Introduction

Hudl recently ran into some really strange networking failures in our testing environment — we tracked them down to a pretty obscure network setting in AWS EC2.

If you’re interested in Chef or Linux debugging, AWS networking troubleshooting, or just love a good rabbit hole, read on.

Background

Every weekend at Hudl we run a process to restore a sanitised version of our production databases to our staging environment, to give our developers realistic data to test with before going to production.

This Monday we woke up to a Slack room full of errors from the restoration process, telling us that the Chef run which configures the databases had failed on every single EC2 instance.

The error we were presented with wasn’t super helpful, “EOFError, end of file reached.”

It’s unclear what file it’s talking about. The recipe referenced does write multiple configuration files out, but they haven’t changed for years and it’s unclear why this would suddenly break.

When we logged on to the server ourselves and ran chef-client, we were able to replicate the exact error. So at least we had a starting point.

Is it an access key error?

Our first suspicions were that the access keys the server was using to talk to the Chef Server were invalid. We’d recently rotated the Chef keys associated with this process and could have broken them.

It seemed unlikely as chef-client had clearly been able to pull cookbooks down and start running, but it was the only thing we’d changed recently.

Knife is a CLI that administrators can use to interact with the Chef server. In theory it uses the same API calls as chef-client, so we attempted to use knife to communicate with the Chef server to see if the key was valid.

knife node list -c /etc/chef/client.rb

We were able to retrieve a list of all our nodes registered with Chef and the keys seemed fine.

What do the network hangs mean?

Running Chef again, we noticed that it hung for around two minutes before printing the EOFError, and if you killed the chef-client run with ctrl-c, the stack trace indicated it was stuck trying to write a node attribute back to the Chef Server.

5: node.normal[:aws][:ebs_volume] = Hash[node[:volumes].map { |v| [
6: v.fetch(:name, v[:device][:path].split(‘/’)[-1]),
7: {:volume_id => v[:device].fetch(:volume_id, nil)}]}].reject { |k, v| v[:volume_id].nil? }
8>> node.save

We repeated this a few times and it was always consistently stuck here.

Was this some sort of networking issue talking to the Chef Server?

It seemed weird we’d have a networking issue talking to the Chef Server. We’d successfully pulled down all the cookbooks from it, and had been able to talk to the Chef Server with knife.

We decided to use knife to try and save a node attribute; this was, after all, what the chef-client seemed to get stuck on.

We ran:

EDITOR=vim knife node edit <NODE_NAME> -c /etc/chef/client.rb

And edited a random attribute, which worked fine.

What does the Chef Server See?

We were slightly confused as to why chef-client was failing to save a node attribute to the server, as we could do this using knife. We decided to run chef-client with debug logs enabled, hoping that it would give us some context around the error.

chef-client -l debug[2022–02–28T11:18:53+00:00] DEBUG: Initiating POST to https://chef12-server.app.hudl.com/organizations/hudl/reports/nodes/<node name>/runs/<run_id>[2022–02–28T11:18:53+00:00] DEBUG: Content-Length: 301
[2022–02–28T11:18:53+00:00] DEBUG: — — End HTTP Request Header Data — — 
[2022–02–28T11:18:53+00:00] DEBUG: — — HTTP Status and Header Data: — — 
[2022–02–28T11:18:53+00:00] DEBUG: HTTP 1.1 200 OK
[2022–02–28T11:18:53+00:00] DEBUG: server: openresty/1.7.10.1
[2022–02–28T11:18:53+00:00] DEBUG: date: Mon, 28 Feb 2022 11:18:53 GMT
[2022–02–28T11:18:53+00:00] DEBUG: content-type: application/json
[2022–02–28T11:18:53+00:00] DEBUG: content-length: 2
[2022–02–28T11:18:53+00:00] DEBUG: connection: close
[2022–02–28T11:18:53+00:00] DEBUG: — — End HTTP Status/Header Data — — 
[2022–02–28T11:18:53+00:00] DEBUG: EOFError: end of file reached

The logs seem consistent with the stack trace we got before. The HTTP POST is presumably the update to the node attribute.

We’d come as far as we could on this server, and decided to see if the logs on the other side of the HTTP connection had any more information.

The Nginx log on the Chef Server had more information on the request. It had returned a HTTP 408, which is a timeout.

172.27.75.108 — — [28/Feb/2022:11:57:20 +0000] “PUT /organizations/hudl/nodes/<node name> HTTP/1.1” 408 “60.058” 0 “-” “Chef Client/12.14.89 (ruby-2.3.1-p112; ohai-8.20.0; x86_64-linux; +https://chef.io)" “-” “-” “-” “12.14.89” “algorithm=sha1;version=1.1;” “<node name>” “2022–02–28T11:55:20Z” “<random id>” 1164

We now seemed to be getting somewhere — but what on the Chef Server was causing the requests to timeout?

Is the Chef Server healthy?

The Chef Server runs a lot of different processes, from Rabbit, Solr and Redis, to the actual Chef Server processes itself. This makes it hard to understand the “health” of the Chef Server, so we had to check each process was running and read the logs from them one by one.

There were a few red herrings in the log files, errors that are apparently normal, but overall nothing seemed obviously broken.

Is chef-client broken everywhere?

Confused about what was happening, we started to look at if it was just our newly restored databases or a bigger problem.

All our production servers could still run Chef fine, but we could replicate the same error on all our servers inside our staging VPC.

Has anyone else had this issue?

Even more confused, we turned to Google to see if others had this oddly specific error where some servers couldn’t write to the Chef Server.

It turns out others had experienced almost the exact same error: https://github.com/chef/chef/issues/1937.

Worryingly, the issue seemed to be related to low level networking settings: https://github.com/chef/chef/issues/1937#issuecomment-53661631.

It seemed like a large MTU setting could cause servers to be unable to send data to the Chef Server if the requests were over a certain size in some environments.

What is an MTU?

The MTU setting on a network interface specifies the maximum size of the packets the interface sends.

Due to the overhead of things like headers on the packets, it’s more efficient to send fewer but larger packets.

But the internet is littered with an assortment of networking devices, some of which can only handle smaller packets.

This means you may be forced to send smaller packets to ensure all the devices they pass through can handle them.

A generally accepted MTU is 1500. Some networks such as inside AWS VPC support MTUs of 9001 (referred to as Jumbo Frames), and it appears the AMIs AWS provides launch EC2 instances with the MTU set to 9001, which is what our EC2 instances have.

Can we replicate the issue reported on Github?

This issue seemed plausible for explaining why our tests worked with knife but the chef-client run still failed. We were still slightly sceptical and wanted to try and confirm this was the issue.

Both our Chef Server and Database had MTUs of 9001.

[ec2-user@mongo ~]$ ifconfig
eth0 Link encap:Ethernet HWaddr 0E:36:F3:9C:96:6D
inet addr:172.27.75.108 Bcast:172.27.79.255 Mask:255.255.248.0
inet6 addr: fe80::c36:f3ff:fe9c:966d/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:9001 Metric:1
[root@chef-server ~]# ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9001
inet 10.110.78.16 netmask 255.255.224.0 broadcast

On the Chef Server we ran netcat in listening mode, printing out the number of bytes it received.

[root@chef-server ~]# nc -l 9234 | wc — bytes

On the MongoServer we used dd to generate fixed size amounts of data and send it to our netcat process.

ec2-user@t-augmentation-mongo-rs1-use1d-520nmi ~]$ dd if=/dev/urandom count=1 bs=9002 | nc chef12-server.app.hudl.com 9234

We couldn’t replicate the exact findings in the Github issue thread — packets of size 9002, 1500 and all sizes in between worked fine.

We randomly tried some larger sizes, and 20,000 byte requests seemed to replicate the issue as they were never received by the Chef Server.

Some binary searching led us to discover that 10,497 bytes was the tipping point. Any requests equal to or larger than this failed, but smaller than this worked.

It still isn’t clear why 10,497 is the tipping point, and not 9002, but presumably it’s related to headers on the packets or something like that.

What’s actually happening to the packets?

Curious about what was actually happening, we used tcpdump to capture the traffic on the database instance (we tried on the Chef Server, but installing tcpdump required updating a bunch of other packages that we weren’t sure were safe to upgrade), and the results were interesting.

Working requests were unexciting. The TCP connection was established, the packets sent, and the connection torn down.

On the failed requests, the TCP connection was established and then after sending the first packet of data, no ACK is ever received and it just goes into a constant TCP retry loop.

So the size of the packet does seem to affect if it reaches the destination.

Does a lower MTU fix the issue?

Now we lowered the MTU on the database server from 9001 to 1500 and repeated the tests to see if it made any difference.

It fixed the issue! Our chef-client run succeeded and by repeating the test with netcat, we confirmed data of all sizes made it to the server.

We’d narrowed down the issue but more questions still remained:

What caused this to break? It worked fine last week.
How were we going to fix this? Do we really have to go and change our MTU everywhere in staging? Why does it still work in Prod?

Can tracepath help?

Some Googling suggested that tracepath might be able to show us what the MTU was on the path between our database and Chef Server.

[root@t-augmentation-mongo-rs1-use1d-520nmi ~]# tracepath 10.110.78.161?: [LOCALHOST] pmtu 1500
1: no reply
2: ip-10–110–78–16.ec2.internal 1.262ms reachedResume: pmtu 1500 hops 2 back 1

As expected it confirms 1500.

To experiment we set the database servers MTU back to 9100 and ran the tool again.

[root@t-augmentation-mongo-rs1-use1d-520nmi ~]# tracepath 10.110.78.161?: [LOCALHOST] pmtu 9100
1: no reply
2: no reply
<snip>
31: no replyToo many hops: pmtu 9100
Resume: pmtu 9100

This time it didn’t seem to work. We were under the impression tracepath would try different MTUs until it found one that worked, but it just seems broken.

We played around with tracepath some more and were never able to get it work, unless we manually lowered the servers MTU to 1500.

What about Path MTU Discovery?

This wasn’t getting us any closer to a fix though, so next we turned our attention to Path MTU Discovery.

Path MTU Discovery is supposed to be a way for servers with differing MTUs to negotiate a mutually compatible MTU setting for their packets.

If a device between the servers, or the server itself, receives a message that is too large for it to process, it drops the packet and sends back an ICMP Fragmentation Needed packet. This signals to the client that it must reduce its MTU and send the packet again.

Why wasn’t this working for us, and allowing our servers to negotiate a compatible MTU?

We suspected that maybe the security groups were blocking the ICMP messages. As an experiment we altered the security groups for both the Chef Server and Database to allow ICMP packets in from anywhere, but it made no difference.

How do the Prod and Staging networks differ?

Running out of ideas, we turned our attention to try and work out why the Prod servers could reach the Chef Server but the staging servers could not. What were the differences in their networks?

Our Chef Server lives in its own VPC for “internal tools,” whilst Prod and Staging are separated into their own VPCs.

During this investigation, we discovered that just before the weekend, the staging VPC had been altered by another team to send its packets to the internal tools VPC via a Transit Gateway. However, the return packets still went down a VPC Peering connection.

The Prod VPC was untouched, and used a separate proxy for communicating with the Chef Server.

This seemed like it must be the issue, but it was still unclear. Why had we configured our Transit Gateway incorrectly?

How did the Transit Gateway cause the issue?

Reading the AWS Documentation on MTUs, there’s a small note about MTUs and TransitGateways

“For more information about supported MTU sizes for transit gateways, see MTU in Amazon VPC Transit Gateways.”

Buried in the “Quotas” section of the MTU docs are these useful points:

A transit gateway supports an MTU of 8500 bytes for traffic between VPCs, Direct Connect gateway, and peering attachments. Traffic over VPN connections can have an MTU of 1500 bytes.
When migrating from VPC peering to use a transit gateway, an MTU size mismatch between VPC peering and the transit gateway might result in some asymmetric traffic packets dropping. Update both VPCs at the same time to avoid jumbo packets dropping due to a size mismatch.
Packets with a size larger than 8500 bytes that arrive at the transit gateway are dropped.
The transit gateway does not generate the FRAG_NEEDED for ICMPv4 packet, or the Packet Too Big (PTB) for ICMPv6 packet. Therefore, the Path MTU Discovery (PMTUD) is not supported.

We had essentially just reverse engineered all of this information.

Transit Gateways don’t support MTUs of 9001, they don’t support Path MTU Discovery, and asymmetric paths can cause issues, all of which we’d been experiencing.

Some more Googling revealed this fantastic blog post, where they’d experienced the exact same issue we had: https://medium.com/seek-blog/beware-of-transit-gateways-bearing-large-packets-77702c4c1b20.

We won’t repeat all of what they’ve explained excellently. Go and read that post for the full technical details on what went wrong here and then come back.

The short version is when using TCP connections:

Networking devices such as TransitGateways can rewrite the value of a field in the TCP Header during connection establishment, called the Maximum Segment Size
This field tells the receiver the maximum size of the packets it should send in response
Packets sent from the database to the Chef Server would have had the MSS changed to 8500 as they passed through the TransitGateway
Packets sent from the Chef Server to the database would NOT have the MSS changed to 8500 as they don’t pass through the TransitGateway, so the database server doesn’t understand it shouldn’t send packets greater than 8500 bytes
The initial connection was able to be established via these asymmetric paths as the SYN / SYN ACK packets are naturally smaller than 8500 bytes

How did we verify Asymmetric Network Paths were indeed the issue?

We altered the Route Tables to send the traffic in both directions through the TransitGateway and everything started working again with the MTU still set to 9001.

We took packet captures again, whilst sending 10,497 byte packets from the database to the Chef Server to verify the MSS is being altered with the fix in place.

Let’s look at what we have from the database client before the fix.

On the TCP SYN packet the Database sent to the Chef Server the MSS is 8961.

On the TPC SYN ACK, the Chef server sent back to the Database the MSS is 8961.

And now compare that to the databases perspective after the issue is fixed.

The SYN packet the database sends to the Chef Server still has a MSS of 8961.

The SYN ACK we get back from the Chef Server to the database has a smaller MSS of 8460.

There we have it — verification that the MSS was not getting set to the smaller maximum supported value when we had the issue.

After the fix, when traffic in both directions was going through the TransitGateway, the TransitGateway alters the MSS to the correct lower value so the clients can send appropriately sized packets through it.

Conclusion

This investigation took around a day and a half to complete, and took us deep into the intricacies of how networking inside AWS works.

Without the Github issue and blog post from others experiencing similar issues, it may have taken significantly longer to fix this.

We hope this post might similarly prove useful as we highlighted some networking debugging tools and techniques you can use to troubleshoot issues.

It was an interesting dive into an area we don’t normally have to touch. Most importantly, we got everything working again.