Case study: Network bottlenecks on a Linux server: Part 1 — The NIC
Not long ago I had to track down why websocket connection attempts from our clients to one of our services always started being refused soon after our peak time window began. I’m not a system administrator not a networking expert, but I’ve been using Linux systems and administrating smaller Linux servers for a long time, so at least I knew what to investigate and which topics I had to read up on. The following is a case study of what was investigated, and which actions where taken and why.
The series is divided into (likely) these four parts:
- Part 1: The NIC (this article)
- Part 2: The Kernel
- Part 3: Interrupts
- Part 4: Going further
During peak time, we could see that the amount of open websockets reached a sort of “cliff,” where no new connections where being accepted, and the amount of connections rapidly dropped:
We noticed that we where getting a buildup of connections in the SYN-RECV state right before this cliff happened. The below image illustrates it, though the timestamps of the below graph is not from the same period as the graph above:
A correlation was obvious, since the SYN-RECV buildup was consistently starting about 30 minutes before the connection drop-off happened. We now had the first clue to where we should start looking.
SYN-RECV is a state for a TCP connection. TCP is a session-based protocol (compared to UDP which is fire-and-forget), so starting a TCP session requires a 3-way handshake:
- Client sends a SYN packet,
- Server responds with an ACK packet,
- Client responds with a SYN+ACK packet,
- The TCP connection enters the “established” mode.
For more in-depth explanation about TCP states, the book TCP/IP Illustrated is a very good source.
A TCP handshake looks something like the following:
If the server is too busy (more on this later), it will usually (on Linux systems) wait with sending the SYN-ACK response until it has a guarantee from the kernel that it can be handled. After the SYN is received, and before the SYN-ACK is sent back, the connection is in a SYN-RECV state. The current number of connections in the SYN-RECV state can be checked like this:
In most cases, you wish the count to be 0, otherwise it likely means that there is a bottleneck somewhere so the amount of connections is coming in faster than you can handle them.
This number is also what will skyrocket if someone is performing an attack called a “SYN flood,” which basically means an attacker sends many SYN packets, but never send the ACK package. The server will keep all connections in a waiting state and before long they will occupy all available resources, and legitimate connections can’t be initiated:
More information about SYN floods can be found here.
Knowing what the SYN-RECV buildup is, we’ll continue with how to monitor and tweak it.
NIC Ring Buffer
The network card is the first responder to an incoming packet. The network card is called the NIC, an acronym for Network Interface Card. The NIC has a small DAM memory chip in it (Direct access memory), which is where new connections are stored until the kernel can copy it to the system. This temporary storage for packets is called the NIC Ring Buffer, or DMA Ring Buffer. It’s a circular buffer, so if it fills up, packets will be dropped (a bit more about this late, some garbage collection happens first to try to avoid dropping them). Visually it looks like this:
The size of the ring buffer is often something the kernel driver can adjust, if the NIC firmware allows it.
In our case, we started to look at the size of the ring buffer. It might have been that during peak time, new packets where coming in faster than the application could handle them. Theoretically it should be able to handle it, since it’s far from being a network intensive application. So, we wanted to increase the ring buffer size. First we checked the current values:
“Hm, that seems odd,” was my initial thought. Most of the examples and articles I had found while reading up on ring buffers pointed towards 4096 being the average, and if having a more expensive card it usually was expected to have a max size limit of 8192. How come this card, which is in a blade server, and apparently costs over $200 USD each (we have four identical NICs on these machines), has such a low max value? Even my cheap workstation and our lab servers supports 4096.
Before changing the limit I searched around for an explanation of these “pre-set maximums” and where the restriction came from. Some resources mentioned it being possible to exceed the maximums if the firmware allowed it. It also turns out that the limits being printed in the console comes from the driver in the kernel; so let’s start digging on Torval’s github! First find the name of the driver being used:
So now we knew the driver was called “tg3,’ and the source code is on github, where the following commend is written where the limits have been set:
/* These numbers seem to be hard coded in the NIC firmware somehow.
* You can't change the ring sizes, but you can change where you place
* them in the NIC onboard memory.
#define TG3_RX_STD_RING_SIZE(tp) \
(tg3_flag(tp, LRG_PROD_RING_CAP) ? \
TG3_RX_STD_MAX_SIZE_5717 : TG3_RX_STD_MAX_SIZE_5700)
Oh well, then just set the max value:
This still did not solve the issue, so in the next post, I’ll continue our case study by investigating how packets are being moved from the NIC to the kernel, and how to monitor and tweak it.
Thanks for reading, and feel free to comment and give questions!