Trade-off Between Features & Performance

10 min readJul 23, 2019

At Mode.net, we use Linux as part of our dataplane for packet processing. Our customers expect private network performance delivered with Internet affordability. Efficiency is key, realized in part via optimal use of our underlying system’s computing resources. Picking the right tool for the right job enables us to provide seamless services to our customers.

Bandwidth is a critical, limited resource, so it’s important to make sure it gets allocated carefully. To do this, we need to be able to enforce tenant-specific traffic profiles (e.g. bandwidth sharing, rate limiting, early drops etc). We rely heavily on Linux TC (traffic control) to implement such requirements. Since Linux TC comes with a plethora of algorithms (aka tools) for solving such requirements, it’s important to understand how each of them work and impact the underlying system.

In this article we offer examples of two services to highlight when and how we strike that features/performance balance. Both of these services have their own set of traffic control requirements. First, we will look at how we use Linux TC’s HTB algorithm to solve each of these scenarios. Second, we will explain why the fully-featured HTB is not ideal for all use cases. Finally, we will explore the use of the TC policing algorithm to solve a subset of these requirements, improving overall performance.

Problem Statement

Imagine a Service A that is accessible via Network N, with the only way to access this network via its Gateway Router G. In order to provide this service, Gateway Router G must support the following requirements:

Clients should be able to specify their guaranteed service rate (committed rate).
Clients should be able to use any available bandwidth (i.e. bandwidth that is not being used by others).

Illustration of a network’s gateway with it’s service link capacity.

Let’s assume we have clients A-D who want to join with the following criteria:

Client A requires a min of 5Gbps, but can use extra if available.
Client B requires a min of 2Gbps, but can use extra if available.
Client C requires a min of 4Gbps, but can use extra if available.
Client D requires a min of 3Gbps, but can use extra if available.

Illustration showing the link capacity of a multitenancy, gateway

Problem Analysis

We begin by dissecting Gateway Router G’s requirements to better understand how to solve them.

Clients should be able to specify their guaranteed service rate (committed rate).
This requirement states that whenever a client wants to access Service A, it should be guaranteed its minimum required bandwidth, irrespective of bandwidth utilization by other clients. In other words, implement a rate limit policy per client.
Clients should be able to use any available bandwidth (i.e. bandwidth that is not being used by others).
This requirement states that if there is any unused bandwidth (e.g. clients are not using their committed bandwidth), other clients can use that unused bandwidth. In other words, we implement a bandwidth sharing policy among the clients.

Linux TC provides one such tool (algorithm) that satisfies both of these requirements (rate limiting and bandwidth sharing): Hierarchical Token Bucket (HTB).

Solution Overview

Now let’s take a high-level look at Hierarchical Token Bucket (HTB) and see how it can help with our scenario.

HTB allows us to shape our outgoing traffic in the Linux kernel. It does this by keeping a set of tokens in a bucket that is refilled at the configured rate. These tokens are then used to transmit packets. In other words, when a packet needs to be transmitted by the Linux kernel, it uses the desired number of tokens, based on the size of the packet. If there are enough tokens available, the packet can be sent (i.e. placed on the physical interface to be transmitted). If there aren’t enough tokens, the packet waits until the bucket contains the required number.

Ok, so how does this help fulfill our requirements of rate limiting and bandwidth sharing?

Let’s say we define a bucket with sufficient tokens to transmit packets at line rate (i.e. 40Gbps). When Client A decides to send traffic, we create a policy in Linux TC that should give at least 5Gbps worth of tokens for Client A’s traffic. We do the same for other clients. In this way this policy essentially defines rate limiting for each client’s traffic.

Tokens left over in the bucket after satisfying the minimum requirement of rate limiting can be shared across all clients (that have requested additional bandwidth), thus implementing bandwidth sharing.

Now we know HTB can help us, but we also have a small problem. HTB’s description says it works only on outgoing traffic, and in this case we are trying to shape incoming traffic. We need a way to apply HTB to incoming traffic to be able to satisfy our requirements.

For that exact reason Linux provides a virtual interface construct called Intermediate Functional Block or IFB which can be used to queue incoming traffic for shaping purposes.

In fact, to make use of HTB on incoming traffic, we need to redirect all incoming traffic to an IFB device (e.g. ifb0), and apply HTB as traffic leaves ifb0 (as it’s now outgoing traffic).

Showing linux, kernel, nic, tc, htb, and shaping.

Now that we have all the pieces to configure our traffic profiles, here is how the high-level implementation will look:

Forward all incoming traffic to ifb0.

E.g.:

tc filter add dev eth0 parent ffff: protocol ip u32
  match ip src 0.0.0.0 match ip dst 0.0.0.0 flowid 1:1
  action mirred egress redirect dev ifb0

2. Create a parent policy to limit max rate for all of the traffic.

E.g.:

tc class add dev ifb0 parent 1:0 classid 1:1 htb rate 40gbit
  burst 4gbit

3. Create a child policy per client specifying their respective guaranteed rate.

E.g.:

tc class add dev ifb0 parent 1: classid 1:101 htb rate 5gbit
  ceil 40gbit burst 800mbit cburst 1gbit

4. Create respective filters to direct each client’s traffic to their own child policy so that the traffic can be enforced appropriately.

E.g.:

tc filter add dev ifb0 parent 1:0 prio 1 u32
  match ip src <ClientA_ip> match ip dst <ServiceA_ip> flowid 1:101

With these in place, Gateway Router G is now able to enforce rate limit and bandwidth sharing for its incoming traffic, thus meeting all its requirements, and making all its clients happy! Every client is guaranteed its requested bandwidth, and they are able to use more bandwidth when available.

Problem Statement Revisited

Still, there are a number of lingering questions. Is this setup satisfactory in terms of performance? What happens if one of the clients had a strong requirement high throughput? Let’s look into these.

Say we introduce a Service B in the network, and Service B is more resource intensive.

In order to provide this service, the Gateway Router G should support the following requirement:

Clients should be able to specify their guaranteed service rate (committed rate).

Assume we have clients E-G who want to join with the following criteria:

Client E requires a min of 8Gbps.
Client F requires a min of 6Gbps.
Client G requires a min of 5Gbps.

These requirements look similar to the rate limit policy we saw earlier with Service A. That’s great, since we can use HTB for this as well!

Performance Analysis

Even though HTB works for both of these services, it comes with a performance tax that is pretty significant! For instance, on 40G incoming traffic, the max throughput with HTB implementation is about 6Gbps.

In order to understand what’s causing this limitation, we need to look under the hood (i.e. Linux kernel). For that we can use an in-kernel linux performance analysis tool called Perf.

Packet stats at incoming (eth0) and outgoing (eth1) interfaces:

mode@server:~$ sudo ./pkts_stats.sh eth0rx eth0: 1352284 pkts/s tx eth0: 0 pkts/s
rx eth0: 1316514 pkts/s tx eth0: 0 pkts/s
rx eth0: 1300302 pkts/s tx eth0: 0 pkts/s
rx eth0: 1308288 pkts/s tx eth0: 0 pkts/s
rx eth0: 1322538 pkts/s tx eth0: 0 pkts/s
rx eth0: 1341731 pkts/s tx eth0: 0 pkts/s
rx eth0: 1328913 pkts/s tx eth0: 0 pkts/smode@server:~$ sudo ./pkts_stats.sh eth1rx eth1: 0 pkts/s tx eth1: 553430 pkts/s
rx eth1: 0 pkts/s tx eth1: 555665 pkts/s
rx eth1: 0 pkts/s tx eth1: 552845 pkts/s
rx eth1: 0 pkts/s tx eth1: 559890 pkts/s
rx eth1: 0 pkts/s tx eth1: 559707 pkts/s
rx eth1: 0 pkts/s tx eth1: 558648 pkts/s
rx eth1: 0 pkts/s tx eth1: 557958 pkts/s

P.S. pkts_stats.sh essentially reads rx_packets and tx_packets from /sys/class/net/{interface}/statistics

Perf’s output:

$ sudo perf topSamples: 2K of event 'cpu-clock', Event count (approx): 561312500
Overhead  Shared Object                          Symbol     68.75%  [kernel]         [k] native_queued_spin_lock_slowpath
     12.33%  [kernel]         [k] _raw_spin_lock
      5.12%  [kernel]         [k] mlx4_en_process_rx_cq

Woah! That’s a lot of cpu cycles being spent on spinlocks.

Why is spinlock topping the charts here? There must be a single point of contention that has all the other threads waiting on.

To find out what’s going on, let’s revisit HTB’s algorithm. The following visual should help:

As per our configuration, this is how our HTB tree looks. The root class defines the max rate and ceiling. The leaf classes represent the traffic profile per client.

In order to support the bandwidth sharing feature, HTB implements a concept called borrowing.

Let’s assume Client C has exhausted all of its configured bandwidth (defined by rate), but still has more data to transmit (bounded by its ceiling rate). Client C can request root for more tokens. Clients with unused tokens can lend them to others, and the root can make these tokens available for Client C to borrow. To do this, the root needs to lock on the requested resources (tokens) to ensure their availability. In this case, the root class is acting as the sole arbiter of token transaction — a single point of contention — and this token sharing is making operations very expensive!

P.S. As a packet enters and leaves the TC subsystem, there are few more places where it experiences spinlocks, which additionally adds up to the performance degradation.

How do we solve this?

Solution revisit

If we examine our requirements closely, we find that even though the requirements look very similar (i.e. both of the services have rate limiting requirements), a key difference exists: Service B’s clients lack the bandwidth sharing requirements that Service A’s clients need. If we can decouple the traffic control implementation of these two services, we might be able to avoid some resource contention issues for selective traffic, and achieve better performance.

Turns out we can, by using another popular rate limiting mechanism called policing for Service B’s clients. Unlike shaping (which is the cornerstone of the HTB algorithm), policing drops excess traffic whenever traffic rate exceeds max configured rate, thus avoiding any costly token sharing operations. Even better, we can implement TC policing on ingress traffic directly, which means there is no need to mirror incoming Service B traffic to ifb0.

Let’s look at how to configure TC policing for Service B.

Forward ONLY Service A’s traffic to ifb0.

E.g.:

tc filter add dev eth0 parent ffff: protocol ip u32
  match ip src 0.0.0.0 match ip dst <ServiceA_ip> flowid 1:1
  action mirred egress redirect dev ifb0

2. Remove Service B’s clients from HTB. This entails removing TC class and TC filters for Client E, F and G.

3. Create respective policing filters to direct each client’s traffic with the desired traffic profile.
E.g.:

tc filter add dev eth0 parent ffff: u32
  match ip src <ClientE_ip> match ip dst <ServiceB_ip>
  police rate 8gbit burst 250k drop

Packet stats after the change:

mode@server:~$ sudo ./pkts_stats.sh eth0rx eth0: 1855971 pkts/s tx eth0: 0 pkts/s
rx eth0: 1854922 pkts/s tx eth0: 0 pkts/s
rx eth0: 1858721 pkts/s tx eth0: 0 pkts/s
rx eth0: 1854601 pkts/s tx eth0: 0 pkts/s
rx eth0: 1855566 pkts/s tx eth0: 0 pkts/s
rx eth0: 1850848 pkts/s tx eth0: 0 pkts/s
rx eth0: 1857871 pkts/s tx eth0: 0 pkts/smode@server:~$ sudo ./pkts_stats.sh eth1rx eth1: 0 pkts/s tx eth1: 1727129 pkts/s
rx eth1: 0 pkts/s tx eth1: 1722336 pkts/s
rx eth1: 0 pkts/s tx eth1: 1720049 pkts/s
rx eth1: 0 pkts/s tx eth1: 1725564 pkts/s
rx eth1: 0 pkts/s tx eth1: 1727829 pkts/s
rx eth1: 0 pkts/s tx eth1: 1726499 pkts/s
rx eth1: 0 pkts/s tx eth1: 1727806 pkts/s

P.S. pkts_stats.sh essentially reads rx_packets and tx_packets from /sys/class/net/{interface}/statistics

Perf’s output after the change:

$ sudo perf topSamples: 2K of event 'cpu-clock', Event count (approx): 601623640
Overhead  Shared Object                          Symbol     22.14%  [kernel]         [k] native_queued_spin_lock_slowpath
      7.89%  [kernel]         [k] _raw_spin_lock
      5.19%  [kernel]         [k] mlx4_en_process_rx_cq

Much better! Now we see about 19Gbps throughput — 3X increase!

Concluding remarks

As the networking industry continues its packet processing hardware transition from dedicated and specialized to general-purpose and commodity, it’s more critical than ever before to maintain a balance between features and performance.

We must work within the reality that computing resources are limited per server (i.e. there are only so many available CPU cycles per server) and that each additional dataplane feature (e.g. traffic control, NAT, etc.) consumes more CPU cycles. This inevitably leads to resource contention among features. Starving them of cycles and hurting overall system performance. Careful planning is absolutely vital to optimal performance when designing these next-generation solutions.

About the Author:
Ari Saha is a Staff Software Engineer at Mode.net with experience working on technologies such as networks and systems engineering, cloud/virtualization, Linux kernel development and machine learning.

Trade-off Between Features & Performance

Problem Statement

Problem Analysis

Solution Overview

Problem Statement Revisited

Performance Analysis

Solution revisit

Concluding remarks

Written by Mode