Diagnosing networking issues in the Linux Kernel

Mixpanel Eng
Mar 26, 2015 · 4 min read

A few weeks ago we started noticing a dramatic change in the pattern of network traffic hitting our tracking API servers in Washington DC. From a fairly stable daily pattern, we started seeing spikes of 300–400 Mbps, but our rate of legitimate traffic (events and people updates) was unchanged.

Image for post
Image for post

Pinning down the source of this spurious traffic was a top priority, as some of these spikes were triggering our upstream routers into a DDos mitigation mode, where traffic was being throttled.

There are a couple of good built-in linux tools that help in diagnosing networking issues.

  • ifconfig will show you your interfaces and how many packets are moving across them
  • ethtool -S will show you some more detailed information on packet flow, with counters for things like dropped packets at the NIC level.
  • iptables -L -v -n will show you the counts of packets being processed by your various firewall rules
  • netstat -s will show you the values of a bunch of counters maintained by the kernel network stack, eg. the number of ACKs, the number of retransmits, etc.
  • sysctl -a | grep net.ip will show you all your kernel network related settings.
  • tcpdumpwill show you the content of the packets going back and forth

The clue to our problem was in the output of . Unfortunately when you look at the output of this command, it can be hard to tell what the numbers mean, what they should be, and how they are changing. To help see how they were changing, we created a small program to show the numeric deltas between successive runs of a command which allowed us to see how fast the various counters were ticking. One of the output lines looked particularly worrying.

Image for post
Image for post

The usual rate of this counter on an unaffected server of ours is more like 30–40 per second so we knew something was wrong here. The counter suggested that we were rejecting a large amount of packets because they had invalid values for TCP timestamps. The short term fix to quickly mitigate the issue was to turn off TCP timestamps with the following command:

sysctl -w net.ipv4.tcp_timestamps=0

This immediately caused the packet storm to stop. This isn’t a permanent solution though, as TCP timestamps are useful for measuring round-trip time and correctly allocating delayed packets to the right place in the stream. This becomes an issue on high-speed connections where TCP sequence numbers can wrap around in timespans on the order of seconds. For more information on TCP timestamps and performance, take a look at RFC 1323.

At Mixpanel we generally run a tcpdump as well whenever we are seeing abnormal traffic patterns, so that we can analyze the traffic afterward to try and determine a root cause. What we found was a huge number of TCP ACK packets being sent back and forth between our API server and a particular IP address. Effectively our server was stuck in an infinite loop with another server sending TCP ACK packets back and forth. Each host was continually acknowledging a TCP timestamp that the other end did not recognize as being valid.

At this point we realize we are dealing with an issue that can only be solved in the linux kernel TCP stack. So our CTO went to the linux-netdev mailing list to see if we could find a solution. Thankfully we found that this issue has been encountered before, and there was a solution available. It turns out this type of packet storm can be initiated by some faulty hardware or 3rd party changing the TCP SEQ, ACK, or Timestamp values in a connection to the point where each host thinks that the other is sending out-of-window packets. The way to avoid this turning into a packet storm is to limit the rate at which Linux will send duplicate ACK packets to only one or two per second. Here is a great explanation on the topic.

We were able to take this patch and backport it to the current Ubuntu (trusty) kernel that we use. Thankfully Ubuntu makes this pretty simple, and recompiling the patched kernel was simply a matter of running the following commands, installing the resulting .deb file and rebooting.

# Get the kernel source and build dependencies
apt-get build-dep linux-image-3.13.0-45-generic
apt-get source linux-image-3.13.0-45-generic
# Apply the patch file.
cd linux-lts-trusty-3.13.0/
patch -p1 < Mitigate-TCP-ACK-Loops.patch
# Build the kernel
fakeroot ./debian/rules clean
fakeroot ./debian/rules binary-headers binary-generic

Originally published at https://engineering.mixpanel.com on March 26, 2015.

Mixpanel Engineering

Stories from eng @ Mixpanel!

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store