Linux kernel bug delivers corrupt TCP/IP data to Mesos, Kubernetes, Docker containers

Vijay Pandurangan
Vijay Pandurangan
Published in
8 min readFeb 11, 2016

(Edit: This article is now up on hacker news, feel free to discuss there)

The Linux Kernel has a bug that causes containers that use veth devices for network routing (such as Docker on IPv6, Kubernetes, Google Container Engine, and Mesos) to not check TCP checksums. This results in applications incorrectly receiving corrupt data in a number of situations, such as with bad networking hardware. The bug dates back at least three years and is present in kernels as far back as we’ve tested. Our patch has been reviewed and accepted into the kernel, and is currently being backported to -stable releases back to 3.14 in different distributions (such as Suse, and Canonical). If you use containers in your setup, I recommend you apply this patch or deploy a kernel with this patch when it becomes available. Note: Docker’s default NAT networking is not affected and, in practice, Google Container Engine is likely protected from hardware errors by its virtualized network.

Edit: Jake Bower points out that this bug is similar to a bug Pager Duty discovered a while back. Interesting!

How it all started

One weekend in November, a group of Twitter engineers responsible for a wide variety of services got paged. Each affected application showed “impossible” errors, like weird characters appearing in strings, or missing required fields. The nexus of these errors was not apparent due to the distributed nature of Twitter’s architecture. Exacerbating this problem: in any distributed system, data, once corrupted, can cause errors long after the original corruption (they are stored in caches, written to disks in logs, etc.). After a day of working around the clock to troubleshoot at the application layer, the team was able to isolate the problem to certain racks of machines. The team investigated and noticed that incoming TCP checksum errors increased significantly just before the first impacts began. This result seemed to absolve the application from blame: an application can cause network congestion but not packet corruption!

(Edit: talking about “the team” might obscure how many people worked on this. A ton of engineers across the company worked to diagnose these strange failures. It’s hard to list everyone, but major contributors include: Brian Martin, David Robinson, Ken Kawamoto, Mahak Patidar, Manuel Cabalquinto, Sandy Strong, Zach Kiehl, Will Campbell, Ramin Khatibi, Yao Yue, Berk Demir, David Barr, Gopal Rajpurohit, Joseph Smith, Rohith Menon, Alex Lambert, and Ian Downes, Cong Wang.)

Once those racks were removed, application failures vanished. Of course, network-layer corruption can happen for many reasons: switch hardware can fail in bizarre ways, wires can be faulty, power can be flaky, etc.. Checksums in TCP and IP were designed to protect against these failures, and indeed, statistics collected on the machine showed that errors were being detected — so how did a diverse set of applications start to fail?

After isolating the specific switches, attempts were made to reproduce these errors (mostly through a ton of work by SRE Brian Martin). It was relatively easy to cause corrupt data to be received by sending a lot of data to those racks. On some of switches, as many as ~10% of packets were corrupted. However, the corruption was always caught by the TCP checksums in the kernel (reported as TcpInCsumErrors by netstat -a), and never delivered to the application. (On Linux, IPv4 UDP packets can be sent with checksums disabled using the undocumented SO_NO_CHECK option, so in this case we could see that data was being corrupted.)

Evan Jones (@epcjones) had a theory that the corrupted data happened to have a valid TCP checksum. If two bits are flipped in opposite directions (e.g. 01 and 10), at the same 16-bit word position, their effects on TCP checksum cancel one another out (TCP’s checksum is literally a sum). While corrupted data was always at a fixed bit position in the message (mod 32 bytes), the fact that it was a stuck bit (01) eliminated that possibility. Alternatively, since the checksum itself is negated before it is stored, a bit flip in the checksum in conjunction with a bit flip in the data, could cancel each other out. However, the bit position that we observed being corrupted could not touch the TCP checksum, so this seemed impossible.

Soon, the team realized that while our tests were on a normal Linux system, most services at Twitter run on Mesos, which uses Linux containers to isolate different applications. In particular, Twitter’s configuration creates virtual ethernet (veth) devices, and routes all of the application’s packets through these devices. Sure enough, running our test application inside a Mesos container immediately showed corrupt data being delivered on a TCP connection, despite the fact that the TCP checksums were invalid (and were detected as invalid: TcpInCsumErrors was increasing). Someone suggested toggling the “checksum offloading” setting on the virtual Ethernet device, which fixed the problem, causing corrupted data to be correctly dropped.

At this point, we had a valid workaround, and Twitter’s Mesos team quickly pushed a fix to the open source Mesos project, and deployed the setting to all of Twitter’s production containers.

The plot thickens

When Evan and I discussed the bug, we decided that since the TCP/IP contract was being broken by the OS, this could not have been the result of a Mesos misconfiguration, but must have been the result of a previously undetected bug in the kernel networking stack.

To continue our investigation of the bug, we created the simplest test harness possible:

  1. a simple client that opens a socket and sends a very simple, long message once a second.
  2. a simple server (we actually used nc on listen mode) which listens on a socket and prints the output to the screen as it arrives.
  3. a networking tool, tc, which allows the user to randomly corrupt packets at before they are sent over the wire.
  4. Once the client and server were connected, we used the networking tool to corrupt all outbound packets for 10 seconds.

We ran the client on one desktop machine, and the server on another. We connected the machines via an ethernet switch. When we ran the test harness in the absence of containers, it behaved exactly as expected. No corrupt messages were displayed; in fact we got no messages at all during the 10 seconds when packets were being corrupted. Soon after the client stopped corrupting packets, all 10 messages (which had not been delivered) were delivered at once. This confirmed that TCP on Linux without containers were working as expected: Bad TCP packets were being dropped, and continually retransmitted until they could be received without error.

The way it should work: corrupt data are not delivered; TCP retransmits data

Linux and containers

At this point it’s helpful (as we did while diagnosing the issue) to take a step back and quickly describe how the networking stack works in Linux containerized environments. Containers were developed to allow user-space applications to coexist on machines — thereby delivering many of the benefits of virtualized environments (reduce or eliminate the potential for interference between applications, allow multiple applications to run in different environments, or with different libraries) while reducing the overhead of virtualization. Ideally, anything subject to contention ought to be isolated. Examples include disk request queues, caches, and networking.

In Linux, veth devices are used to isolate containers from other containers running on a machine. The Linux networking stack is quite complicated, but a veth device essentially presents an interface that should look exactly like a “regular” ethernet device from the user’s perspective.

In order to construct a container with a virtual ethernet device, one must (1) create a container, (2) create a veth, (3) bind one end of the veth to the container, (4) assign an IP address to the veth, (5) set up routing, usually using Linux Traffic Control, so that packets can get in and out of the container.

Why, it’s virtual all the way down

We recreated the test harness described above, except the server was run within the container. However, when we ran the harness, we saw something completely different: corrupt data were not being dropped, but were being delivered to applications! We had reproduced the error with a very simple test harness (two machines on our desk, and two very simple programs).

Corrupted data are delivered to the application: see the window on the left!

We were able to reproduce this test harness in cloud computing platforms as well. Kubernetes’ default configuration triggers this behavior (e.g. as used in Google Container Engine). Docker’s default configuration (NAT) is safe, but Docker’s IPv6 configuration is not.

Making things right

After reading through the kernel networking code, it became apparent that the bug was in the veth kernel module. In the kernel, packets that arrive from real hardware devices have ip_summed set to CHECKSUM_UNNECESSARY if the hardware verified the checksums, or CHECKSUM_NONE if the packet is bad or it was unable to verify it.

Code in veth.c replaced CHECKSUM_NONE with CHECKSUM_UNNECESSARY — this resulted in checksums that should have been verified and rejected by software to be silently ignored. After removing this code, these packets are transmitted from one stack to another unmodified (tcpdump shows invalid checksums on both sides, as expected), and they are delivered (or dropped) correctly to applications. We didn’t test every possible network configuration, but we tried a few common ones such as bridging containers, using NAT between the host and a container, and routing from hardware devices to containers. We have effectively deployed this in production at Twitter (by disabling RX checksum offloading on veth devices).

We’re not certain why the code was written in that way, but we believe this was an attempt at optimization. Often times, veth devices are used to connect containers on the same physical machine. Logically, packets transmitted between containers (or between virtual machines) on the same physical host do not need to have checksums calculated or verified: the only possible source of corruption is the host’s own RAM, since the packets never go over a wire. Unfortunately, this optimization doesn’t even work as intended: locally generated packets have ip_summed set to CHECKSUM_PARTIAL, not CHECKSUM_NONE.

This code dates back to the first commit of the driver (commit e314dbdc1c0dc6a548ecf [NET]: Virtual ethernet device driver). Commit 0b7967503dc97864f283a net/veth: Fix packet checksumming (in December 2010) fixed this for packets that get created locally and sent to hardware devices, by not changing CHECKSUM_PARTIAL. However, the same issue still occurs for packets coming in from hardware devices.

The kernel patch is included below:

diff — git a/drivers/net/veth.c b/drivers/net/veth.c
index 0ef4a5a..ba21d07 100644
— — a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -117,12 +117,6 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
kfree_skb(skb);
goto drop;
}
- /* don’t change ip_summed == CHECKSUM_PARTIAL, as that
- * will cause bad checksum on forwarded packets
- */
- if (skb->ip_summed == CHECKSUM_NONE &&
- rcv->features & NETIF_F_RXCSUM)
- skb->ip_summed = CHECKSUM_UNNECESSARY;

if (likely(dev_forward_skb(rcv, skb) == NET_RX_SUCCESS)) {
struct pcpu_vstats *stats = this_cpu_ptr(dev->vstats);

Conclusions

I’m really impressed with the linux netdev group and kernel maintainers in general; code reviews were quite prompt and our patch was merged in within a few weeks, and was back-ported to older (3.14+) -stable queues on various kernel distributions (Canonical, Suse) within a month. With the preponderance of containerized environments, it is actually really surprising to us that this bug has existed for years without having been detected. I wonder how many application crashes and other unexpected behaviour could be attributed to this bug! Please reach out if you believe this has affected you.

--

--

Vijay Pandurangan
Vijay Pandurangan

EIR @Benchmark. Formerly: Eng Director & NY Eng Site Lead @Twitter. Founder @MitroCo, TL/M @Google. www.vijayp.ca