Schrödinger’s Packet Drops
Or How My Hypothetical Cat Ate My Very Real Packets
100GbE Speed Bumps
The demand for bandwidth is pushing network software vendors like Niometrics to support 100GbE Network Interface Controllers (NICs). Our network probe uses Data Plane Development Kit (DPDK) to bypass the Linux kernel and direct traffic to user space. This eliminates the overhead of NIC interrupts to achieve high bandwidth processing rates. The 100GbE NICs, however, brought new challenges. One of the issues we faced was that Interprocessor Interrupts (IPIs) were causing sporadic packet drops at lower bandwidths with no apparent congestion. In the first part of this post we briefly discuss the different types of interrupts, interrupt handling hardware and the memory management unit. In the second part we offer a glimpse into troubleshooting, debugging and fixing issues due to the interrupt handling in multi-socket systems.
Before we start, a description of the hardware and software environment involved in this article:
- HPE ProLiant DL560 Gen10 Server with 4x Intel(R) Xeon(R) Platinum 8180M CPU @2.50GHz
- 6TB of memory (1 TB reserved for huge pages using hugepagesz=1G and hugepages=1024 kernel command line arguments to reserve 256 1G huge pages per socket)
- 2x Mellanox Technologies 100GbE single-port QSFP28 ConnectX®-5 EN network interface cards
- CentOS 7 with kernel 3.10.0–957.27.2.el7.x86_64
- DPDK 19.11.0
- All critical cores are isolated and made jitter-free (almost) using isolcpus, nohz_full, rcu_nocbs and rcu_nocb_poll kernel command line parameters
The Problem With Interrupts
In a network probe using DPDK, packet reception is handled by RX threads in polling mode. These threads constantly poll the NIC RX rings for packets in a tight loop. Interrupts “interrupt” the execution of user space tasks by causing a context switch to kernel space to handle the interrupt. This introduces jitter which is too rare to affect throughput, but on 100GbE NICs it can cause packet drops. Increasing the NIC RX ring size (or the number of RX rings) can mitigate this, but not always.
For the sake of clarity we divide interrupts into the following types:
- External I/O interrupts
- Local Timer Interrupts (LOC)
- Interprocessor interrupts (IPI)
Before we describe how each type affects a high performance system, let’s take a look under the hood of the APIC.
Crash Course in APIC Architecture
APIC stands for Advanced Programmable Interrupt Controller. In x86–64 systems, interrupts are handled by:
- A Local APIC (per CPU)
- An External I/O APIC
Local APIC (per CPU)
- The per CPU local APIC receives interrupts and sends these to the processor core for handling. Interrupts can come from the processor’s interrupt pins, internal sources and an external I/O APIC (or other external interrupt controller).
- In Symmetric Multiprocessing (SMP) systems, the local APIC sends and receives IPI messages to and from other logical processors on the system bus. IPI messages can be used to distribute interrupts among the processors in the system or to execute system-wide functions (such as booting up processors or distributing work among a group of processors).
External I/O APIC
- The I/O APIC is part of Intel’s system chip set which receives external interrupt events from the system and its associated I/O devices and relays them to the local APIC as interrupt messages.
- In SMP systems, the I/O APIC also provides a mechanism for distributing external interrupts to the local APICs of selected processors or groups of processors on the system bus.
External I/O Interrupts
We use the irqbalance daemon to move all external interrupts handling away from isolated cores participating in packet processing to a set of cores reserved for the purpose of interrupt handling. This ensures external interrupts do not cause jitter for tasks on isolated cores. It also ensures IRQ affinity is assigned cores “closest” to the NUMA node of device. The IRQBALANCE_BANNED_CPUS environment variable must be set to the mask of isolated cores to ensure no interrupts are assigned to them. Following are some examples to illustrate this.
NIC eno1 uses vectors 188–196.
NIC eno1 interrupts are handled by CPU cores 0 and 112.
Local Timer Interrupts
Local Timer Interrupts are the LOC line in
Prior to kernel 2.6.21, the timer tick ran on every core at the rate of CONFIG_HZ (by default, 1000/sec).
Tickless kernel (in kernels 2.6.21+) disabled the kernel timer tick on idle cores. The scheduling-clock interrupt is used to force a busy core to schedule multiple tasks, and an idle core has no tasks to schedule so it does not require the scheduling-clock interrupt.
The nohz_full kernel command line parameter (introduced in kernel 3.9+) is an optimisation on top of the tickless kernel that extends the tickless behaviour to cores which have only one running task. There is still a need to schedule a tick every second for process management operations like calculating core load, maintaining scheduling average, etc.
The nohz_full option is used for cores dedicated to threads performing packet processing functions.
Interprocessor interrupts allow a CPU to send interrupt signals to any other CPU in the system. They are defined by the BUILD_INTERRUPT macro in arch/x86/include/asm/entry_arch.h. For SMP kernels, “smp_” is prepended to the function name. E.g., for BUILD_INTERRUPT(call_function_interrupt, CALL_FUNCTION_VECTOR) the call function interrupt vector is handled by smp_call_function_interrupt(). An interrupt is raised by calling functions pointed to by send_IPI_all( ), send_IPI_allbutself( ), send_IPI_self( ), send_IPI_mask_allbutself() & send_IPI_mask() function pointers.
Common types of IPI:
- The Call Function (Single) Interrupt (CAL)
- TLB shootdown Interrupts (TLB)
Call Function (Single) Interrupt
These are accounted in the CAL row of
These are used by perf, trace, kvm and others.
TLB Shootdown Interrupts
The TLB shootdown is a special type of CAL interrupt which has its own counter, the TLB line in
/proc/interrupts. The CAL counter does not include TLB shootdowns.
To understand the purpose of TLB shootdown interrupts, we need to briefly discuss how paging works, because they are used for supporting paging on multiprocessors systems.
Paging in x86–64
There exist three kinds of addresses in x86–64:
- Logical addresses — Included in the machine language instructions. It consists of segment and offset.
- Linear addresses (virtual addresses) — A single 48-bit unsigned integer that can be used to address up to 256TB.
- Physical addresses (48-bit unsigned integers) — Used to address memory cells in memory chips.
The paging unit of the MMU (Memory Management Unit) converts linear addresses to physical addresses. The diagrams below illustrate the linear address translation to a 4KB page vs a 1GB page.
In brief, converting a linear address to a 4KB physical page address involves accessing 4 memory locations (PML4E, PDPTE, PDE & PDE), while a 1GB physical page access involves accessing 2 memory locations (PML4E & PDPTE). This is why accessing a 1GB huge page is faster than accessing a 4KB page and hence recommended for use by DPDK.
Translation Lookaside Buffers
Translation Lookaside Buffers (TLB) are caches used to speed up linear address translation. When a linear address is used for the first time, the corresponding physical address is computed through slow accesses to the paging tables in main memory. The physical address is then stored in a TLB entry so that further references to the same linear address can be quickly translated. In a multiprocessor system, each CPU has its own local TLB. Contrary to the hardware cache, the corresponding entries of the TLB need not be synchronised, because processes running on different CPUs may associate the same linear address with different physical ones.
A TLB flush occurs when switching between 2 processes on same CPU. It does not occur when 2 processes share the same page tables (e.g. threads of same process). It does not occur when switching from a regular process to a kernel thread.
On multiprocessor systems, when flushing the TLB on a CPU, the kernel must also flush the same TLB entry on the CPUs that are using the same set of page tables. This is done by TLB shootdown interrupts.
We now proceed with a retracing of the steps we went through in troubleshooting our interrupt-related performance problems.
Step 1 — Find Interrupts Delivered to Cores Running Critical Threads
We had to find the interrupts delivered to cores running the critical RX threads. This can be done by monitoring the interrupt counters for the cores. A simple script can be used to monitor
From the output of the script, we see that LOC is delivered once every second on a nohz_full core running a single user thread. This is expected. The TLB interrupts being delivered (6 interrupts/second) are the ones that need to be debugged further.
Step 2 — Tracing the Kernel
We need to trace the kernel to figure out:
- What handles the interrupt
- What raises it
What Handles the Interrupt?
To determine the CPU+kernel function which processes the interrupt, we have to figure out where the interrupt counter is incremented. The interrupt counter symbol can be found here: arch/x86/kernel/irq.c: arch_show_interrupts(). The handler is the function which increments the counter using inc_irq_stat(). In some cases there is only one interrupt handler. E.g. TLB shootdown handler is flush_tlb_func() (called by interrupt handler of CAL). In other cases we need to look further up the call stack to figure out what work is being done in interrupt. E.g. CAL interrupts can call one of many different functions.
To see the cost of the interrupt handlers on the core, we can use trace-cmd as follows.
From above output we can see that every second we get 7 context switches with 6 of them for the TLB shootdown IPI handler taking up roughly 30–35 microseconds from critical RX thread. NOTE: TLB IPI is a form of CAL IPI hence its handler starts from smp_call_function_interrupt().
What Raised the Interrupt?
Of the three possible sources of interrupts:
- Time period in case of LOC — 1/second on an isolated, nohz_full core executing one thread only.
- I/O request in case of I/O interrupts — we shouldn’t see these due to irqbalance.
- CPU+kernel function which raises the interrupt in case of IPI
Lookup the symbol of the interrupt vector and find the function raising the interrupt via call to send_IPI_* function.
We are interested in the third one.
From the above output, we can see that the system call of munmap(2) is responsible for the TLB shootdown interrupt.
Step 3 — Debug Code to Find Out What the Thread or Task Responsible for the Interrupt Is Doing
From above output, the DPDK call for getting Mellanox NIC stats (using
FILE * operations) is responsible for the TLB shootdown interrupt!
Schrödinger’s Packet Drops
From debug output above we discovered that the munmap(2) was being called by fclose(3) while reading the out_of_buffer counter for the NIC which is the MLX5 PMD’s imiss counter. The packet drops which were happening sporadically were due to the read of the packet drop counter. To complicate matters even further these drops were counted by a different counter called rx_discards_phy. Like Schröedinger’s famous thought experiment, these drops were not present until the drop counters were read.
The fscanf(3) call leads to the file being mmap’ed. The subsequent fclose(3) causes munmap(2) to be called which releases the linear address associated with the physical address of file buffer from TLB by means of a TLB flush operation on the core calling fclose(3). This also sends TLB shootdowns to all the cores executing threads (including critical RX threads) of the same process. The fix was to replace the file stream operations with open(2)/read(2)/close(2) calls. The bug and its fix were reported to DPDK and Mellanox and the patch was accepted.
- Avoid using non-hugepage memory.
- Avoid calls to mmap(2)/munmap(2)/madvise(2)/mprotect(2) with non-hugepages.
- Avoid C library functions which call any of the above system calls or figure out a workaround to continue using those C library calls. E.g. FILE * C library calls can be used with user buffer set using setvbuf(3).
- The impact of TLB shootdowns increases with the number of cores and sockets, because the initiator core of the shootdown has to wait for acknowledgment from all the cores to which it sent the IPI. We noticed fewer packet drops due to TLB shootdown on a 2-socket machine containing half of the number of cores of the setup described in this post.
- “Who watches the watchmen?” Be careful of your measurement tools and code introducing more interrupts and jitter to critical threads. E.g. Tools like trace-cmd and perf themselves use IPIs for tracing and measurement.