Forwarding over 100 Mpps with FD.io VPP on x86

Published in

Google Cloud - Community

19 min readApr 21, 2024

Behind the curtain of every website load and social media feed lies a vast, unseen network of Ethernet packets. These tiny data carriers, generated in staggering quantities, form the backbone of our online experiences. It’s not just human-driven; machine-to-machine communication contributes significantly to this digital symphony. Network devices, often overlooked, are the unacknowledged heroes, managing this traffic at its core. Today’s enterprise-grade switches handle terabits per second, but the cutting edge is far beyond. Google’s Jupiter fabric, for instance, was already managing petabit-scale networks back in 2013, showcasing the incredible pace of technological advancement.

While it’s fascinating to imagine the evolution of network technologies, this Medium series will focus on the present, exploring the powerful packet processing capabilities available on GCP today. We’ll dive into technologies like DPDK, gVNIC, and the Intel IPU E2000 (Titanium) that enable high-performance networking in the cloud.

· A Bit of History
· What’s NFV and DPDK?
· How about FD.io VPP?!
· The Technology Enabling DPDK on GCP
· The Technology Enabling DPDK on Linux
∘ The Role of isolcpus Kernel Argument
· Our Objectives
· Testing Topology
∘ 1st testing topology — TestPMD and VPP
∘ 2nd testing topology — Pktgen and VPP
∘ 3rd testing topology — VPC Peering and TestPMD
· The GCP Platform
∘ Traffic Generator and Network Receiver
∘ FD.io VPP
· VPP Performance on the smallest GCP C3 Machine
· Pushing Beyond 100 Mpps
· Introducing AMD Genoa to FD.io VPP
· Pktgen-DPDK vs. VPP
· Sending and Receiving without VPP
· Conclusions and Next Steps

A Bit of History

Before my Google journey, I spent nearly a decade as an NFV Architect (first at HPE, then Red Hat), immersed in the world of 4G and 5G telco mobile network modernization. One challenge dominated those years: how to maximize packet processing rates with minimal x86 resources, all while keeping packet loss near zero.

Back then, the CPU landscape was vastly different. AMD’s Zen architecture hadn’t yet sparked the multi-core revolution, so systems were typically limited to dual 18-core CPUs. Anything beyond that was either tedious slow or prohibitively expensive. This scarcity of cores necessitated innovative software approaches.

AMD and Intel 2P Server Core Count Growth 2010–2022

Enter DPDK (Data Plane Development Kit) and strong system partitioning, both built on Linux. DPDK bypassed the kernel’s network stack for direct hardware access, boosting packet processing speeds. Strong partitioning, meanwhile, isolated the CPU Cores from any noisy neighbor.

These technologies, born out of necessity, played a pivotal role in shaping the modern telco landscape, paving the way for the high-performance networks we rely on today.

My NFV Telco Lab in Amsterdam — many remarkable memories

What’s NFV and DPDK?

In the early days of 2010, Intel made a significant move by open-sourcing the DPDK framework, revolutionizing packet processing with its capability for extremely low-latency operations at massive rates. During this period, the datacenter standard hovered around 10 Gbps. Remarkably, even basic DPDK applications like TestPMD, l2fwd, and l3fwd demonstrated the ability to saturate such links with packets as small as 64B, achieving an impressive 14 million packets per second (Mpps).

Although the theoretical line rate limit for 10 Gbps is 19 Mpps, practical constraints such as PCIe encoding/decoding efficiency and the Layer 1 forwarding on the physical media meant that initially, only approximately 7 Gbps could be attained.

In response to this challenge, Intel introduced an innovative solution: the Poll Mode Driver (PMD). The PMD concept revolutionized packet processing by implementing a polling mechanism that allowed for the processing of packets within a fixed timeframe, circumventing the need for interrupts. This approach significantly enhanced the efficiency of packet handling. A PMD comprises APIs that enable the configuration of network devices and their corresponding queues. By directly accessing RX (Receive) and TX (Transmit) descriptors, the PMD swiftly receives, processes, and delivers packets within the user’s application, all without the interruption overhead typically associated with traditional interrupt-driven processing.

DPDK and device abstraction simplified representation from the HW NIC to the Network Function (NF)

At a time where Telco carrier were in an open fight with the traditional Network Equipment Providers (the NEPs). The 4G mobile network rollout costs were simply too high and the ATM to IP migration a pain. Telcos were on the forefront of replacing as much proprietary equipments as possible with open-source software and readily available commodity hardware alternatives.

Telcos have built among the most complex systems made of decades of stratifications to support human communication

It was an unprecedented pitch that fundamentally disrupted an entire industry, resulting in the birth of Network Functions Virtualization (NFV). Intel seized the moment with precisely the right solution: a fresh software paradigm, free of legacy constraints, meticulously engineered from scratch to optimize packet processing speed while minimizing resource overhead. This innovative framework was not only exceptionally efficient but also remarkably versatile, capable of accommodating any feature or requirement. Crucially, it was built upon commodity x86 hardware rather than the opaque of proprietary black boxes made with customized hardware. Development on x86 meant standardization resulting in faster and more aggressive software releases.

Example of the level of strong system partitioning required by NFV for deterministic performance

How about FD.io VPP?!

While the hardware landscape primarily offered a singular choice in Intel, the software realm witnessed a proliferation of bespoke implementations from various NEP catering to the diverse needs of 4G mobile networks. This involved breaking down the entire Evolved Packet Core (EPC) into numerous Virtual Network Functions (VNFs) such as S-GW, P-GW, MME, PCRF, and HSS, alongside elements like eNB, RNC, and gNB on the radio side.

Among these, Cisco emerged as a significant player, capitalizing on the rapid growing trend towards Telco open source initiatives. Leveraging their expertise, they unveiled the core of NX-OS known as Vector Packet Processing (VPP), a never sleeping project with over two decades of development under its belt. VPP boasts a vast feature set of L2 to L4 networking functionalities, including switching, ACL-based forwarding, routing, SDN gateways, firewalls, NAT, load balancers, and more.

A distinguishing feature of VPP lies in its integration of DPDK at its core, resulting in unparalleled optimization of packet vectors. While conventional open-source DPDK applications often showcase limited feature sets, VPP stands out as the crown jewel within this ecosystem. In the early stages, the community mantra revolved around 14 Mpps on 3.5 GHz CPU = 250 cycles per packet [SNIP] but memory is ~70+ ns away (i.e. 2.0 GHz = 140+ cycles) showcasing the deep focus the community put into writing superbaly optimized code.

Why just take my word for it when you can delve into VPP’s Comprehensive System Integration Testing (CSIT) methodology and its exhaustive reports? Unlike the AppMod’s canary deployment approach, VPP undergoes rigorous testing not just once per release, but multiple times for each commit. Quite a difference, isn’t it? 😆

The Technology Enabling DPDK on GCP

As previously outlined, a key component of DPDK is the PMD, as the definition says “a PMD consists of APIs to configure the network devices and their respective queues” essentially describe the functionality of a network driver. DPDK reimagines OS NIC drivers, making it essential to have PMD drivers tailored for specific network interfaces. A few years ago Google contributed upstream the gve driver for DPDK. And at the official DPDK git, you can see the latest patches submitted.

The gVNIC driver is purpose-built to deliver superior throughput and lower latency. Notably, third-generation GCE machines and beyond exclusively support gVNIC, leveraging the advanced E2000 IPU architecture to its full potential. And for anyone requiring maximum bandwidth, TIER_1 networking emerges as another compelling feature worth considering.

Configure per VM Tier_1 networking performance | Compute Engine Documentation | Google Cloud

Review network bandwidth limits and enable higher network bandwidths using gvNIC and per VM Tier_1 networking…

cloud.google.com

The Technology Enabling DPDK on Linux

When delving into the realm of running DPDK applications on Linux, whether in the context of GCP or any other platform, a set of system requirements on the Linux side becomes imperative.

Foremost among these requirements is the need for robust CPU isolation, defined as strong system partitioning. It ensures that the cores dedicated to DPDK applications remain isolated from both user space and kernel threads. This isolation is crucial for maintaining optimal and deterministic performance.
DPDK relies on HugePages for its packet buffers. Leveraging 1GB HugePages significantly enhances TLB efficiency, mitigating the risk of page address misses that can lead to high TLB miss rates with standard 2MB page sizes, thereby safeguarding performance.
Allowing user-space access to network devices — an essential aspect of DPDK — is achieved through the VFIO framework. VFIO ideally operates with IOMMU hardware to provide DMA isolation for devices, enhancing system security and stability. In scenarios where IOMMU is unavailable (like inside a VM), VFIO can function in No-IOMMU mode, albeit with reduced isolation. While the UIO framework serves as an alternative for user-space drivers, it lacks the comprehensive feature set and flexibility of VFIO.

To date, only Red Hat has seamlessly implemented all these requirements within RHEL:

Tuned offers a cpu-partitioning profile that simplifies OS tuning with a single command. This is built for RHEL/Fedora. While tuned can run on other Linux distros, it fails at multiple points at applying all tunings.
Since RHEL 7.3, features such as VFIO No-IOMMU and 1G HugePages are natively integrated.

Tuned’s CPU Partitioning profile offers a comprehensive array of optimizations, including:

SystemD CPUAffinity.
IRQBALANCE_BANNED_CPUS.
Enabling the Linux Tickless Kernel and Adaptive-Ticks CPU pinning (from the real-time preempt_rt patches).
RCU Callbacks Offload.
Several kernel-level tuning such as disables nosoftlockup, the Machine Check Exception, and Transparent HugePages, strong affinity for dirty pages worker and the kernel workqueue threads.
Several SYSCTL tunables such as increasing the hung_task_timeout_secs, disabling nmi_watchdog, numa_balancing, and stat_interval, and enabling the IRQ timer_migration.

Before Tuned’s CPU Partitioning all of this tuning was manual, about 10 years ago I was spending the nights awake discovering what was the next kernel tunable and config to enable or disable 😂. Nonetheless, these experiences, while demanding, were undeniably memorable. In fact, I extensively documented this journey during my tenure at Red Hat in a series of three blog posts, offering insights and guidance to fellow enthusiasts navigating the intricacies of system optimization:

Tuning for Zero Packet Loss in Red Hat OpenStack Platform - Part 1

For Telcos considering OpenStack, one of the major areas of focus can be around network performance. While the…

www.redhat.com

Tuning for Zero Packet Loss in Red Hat OpenStack Platform - Part 2

In Part 1 we reviewed the fundamentals of achieving zero packet loss, covering the concepts behind the process. In his…

www.redhat.com

Tuning for Zero Packet Loss in Red Hat OpenStack Platform - Part 3

In Part 1 of this series Federico Iezzi, EMEA Cloud Architect with Red Hat covered the architecture and planning…

www.redhat.com

The rule of thumb is that the first core per NUMA node should be reserved, not just for Kernel threads and Userland but more importantly for the hardware-wired IRQ that will always hit such cores and otherwise would preempt the DPDK PMD threads.

Throughout this Medium post, several machine types were used, yet the same strong system partitioning rule was always used.

When all this tuning is applied, the RHEL OS on GCE boots up with the following kernel command-line parameters (wrapped for easier reading):

BOOT_IMAGE=(hd0,gpt2)/boot/vmlinuz-5.14.0-362.24.1.el9_3.x86_64 \
root=UUID=7a5b4a05-070b-4975-8f5d-40485e04d55a ro net.ifnames=0 biosdevname=0 \
scsi_mod.use_blk_mq=Y crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M \
console=ttyS0,115200 \
skew_tick=1 tsc=reliable rcupdate.rcu_normal_after_boot=1 \
nohz=on nohz_full=1-87 rcu_nocbs=1-87 tuned.non_isolcpus=00000001 \
intel_pstate=disable nosoftlockup \
default_hugepagesz=1G hugepagesz=1G hugepages=64

Before proceeding to the next chapter, I want to extend an enormous shout-out to Red Hat, the Kernel and Tuned team, and the NFV group (with special mention to my good friend Franck Baudin). Their tireless efforts over the years have been monumental, unmatched by any other entity in the industry to date, resulting in something beautiful.

The Role of isolcpus Kernel Argument

One critical aspect not yet addressed is the role of the isolcpus kernel argument, which is surrounded by a somewhat infamous reputation. Historically, isolcpus was a commonly utilized config for deterministic performance, but its relevance has diminish over the years since being largely deprecated. Nowadays, isolcpus is primarily and only reserved for scenarios requiring real-time capabilities.

Nowadays enabling isolcpus effectively disables the Completely Fair Scheduler (CFS) for the isolated CPU cores, but it doesn’t eliminate kernel threads from running on those cores. This means that while process balancing is disabled, kernel threads still occupy these isolated cores. Consequently, the absence of CFS can potentially lead to preemptive actions on DPDK threads, thereby increasing the likelihood of packet drops. Based on my field experience, I’ve observed instances where packet drops occur as a result of isolcpus being enabled.

Behavior of CPU manager with realtime and cpu-partitioning tuned profiles - Red Hat Customer Portal

OpenShift provides the Performance Addon Operator and the Cluster Node Tuning Operator for fine tuning your workers for…

access.redhat.com

Our Objectives

This Medium series was conceived with the promise of running FD.io VPP on GCP. While our initial chapters laid the groundwork by elucidating the theories underpinning NFV and DPDK, our journey is far from over. Here’s what lies ahead:

A detailed blueprint for running DPDK applications on GCP.
The best way to deploy FD.io VPP, TestPMD, and PKTGen.
The performance Titanium-enabled machine can deliver.
What can one expect with C3 and H3 (both based on Intel Sapphire Rapids) and C3D (based on AMD Genoa).
As for the future, another promise of running GCP C4 instances, based on Intel Emerald Rapids. Emerald Rapids IPC improvements alone are somewhat boring, but the overall GCP package could make it look differently.

Testing Topology

1st testing topology — TestPMD and VPP

Beginning with the first testing topology, as depicted in the illustration above. At the heart of this diagram lies VPP, strategically positioned between a traffic generator (situated on the left) and a generic network function serving as the receiver (positioned on the right).

Sender and receiver entities are situated on distinct VPC networks, with VPP seamlessly bridging the two domains. Leveraging routing directives, the GCP Andromeda SDN intelligently directs traffic from VPC Right to the corresponding VPP interface on VPC Right. Subsequently, VPP is primed to relay this traffic through its interface on VPC Left, where the Andromeda SDN undertakes the crucial last-mile delivery role.

In this setup, TestPMD assumes the dual roles of both traffic generator and network receiver. However, it’s important to note that communication is strictly unidirectional. Furthermore, owing to the limitations of TestPMD’s traffic generation capabilities, a single destination IP address is defined.

Follows the GCP routing configuration:

gcloud compute routes create range-48-to-vpp-left \
  --network=left \
  --destination-range=48.0.0.0/8 \
  --next-hop-address=10.10.1.10

gcloud compute routes create range-48-from-vpp-right \
  --network=right \
  --destination-range=48.0.0.0/8 \
  --next-hop-address=10.10.2.40

While the VPP config is the following:

set interface ip address VirtualFunctionEthernet0/4/0 10.10.1.10/32
set interface ip address VirtualFunctionEthernet0/5/0 10.10.2.10/32

ip neighbor VirtualFunctionEthernet0/4/0 10.10.1.1 42:01:0a:0a:01:01
ip neighbor VirtualFunctionEthernet0/5/0 10.10.2.1 42:01:0a:0a:02:01

ip route add 48.0.0.0/8 via 10.10.2.1

Inside the VPP FIB (Forwarding Information Base) table:

2nd testing topology — Pktgen and VPP

Heavily inspired from my upstream work at Red Hat — 2nd testing topology

This second topology closely mirrors traditional Traffic Generator environments found on-premises, akin to setups utilizing IXIA or Spirent.

Within this framework, Pktgen takes center stage, assuming responsibility for both transmitting and receiving traffic. Unlike the previous topology, this arrangement features the utilization of multiple destination IP addresses, fostering bi-directional communication pathways.

From a GCP routing perspective, the setup remains largely unchanged, with one notable exception: the --next-hop-address now directs traffic to the respective IP addresses of the Pktgen GCE instance.

gcloud compute routes create range-17-to-vpp-right \
  --network=right \
  --priority=1000 \
  --destination-range=17.0.0.0/8 \
  --next-hop-address=10.10.2.10

gcloud compute routes create range-17-from-vpp-left \
  --network=left \
  --priority=1000 \
  --destination-range=17.0.0.0/8 \
  --next-hop-address=10.10.1.30

gcloud compute routes create range-49-to-vpp-left \
  --network=left \
  --destination-range=49.0.0.0/8 \
  --next-hop-address=10.10.1.10

gcloud compute routes create range-49-from-vpp-right \
  --network=right \
  --destination-range=49.0.0.0/8 \
  --next-hop-address=10.10.2.30

While the VPP config is the following:

set interface ip address VirtualFunctionEthernet0/4/0 10.10.1.10/32
set interface ip address VirtualFunctionEthernet0/5/0 10.10.2.10/32

ip neighbor VirtualFunctionEthernet0/4/0 10.10.1.1 42:01:0a:0a:01:01
ip neighbor VirtualFunctionEthernet0/5/0 10.10.2.1 42:01:0a:0a:02:01

ip route add 17.0.0.0/8 via 10.10.1.1
ip route add 49.0.0.0/8 via 10.10.2.1

3rd testing topology — VPC Peering and TestPMD

Last but not least, we have a double-checking topology, a third one, aimed at discovering what the underling Andromeda fabric can handle and figure out whether VPP is bottlenecking the routing performance. We do that replacing the two VPC static routes with VPC Peering between VPC Right and VPC Left.

The GCP Platform

Regarding the GCP Platform, while the configurations for the traffic generator, receiver, and VPP share similarities, subtle yet significant underlying differences exist.

Traffic Generator and Network Receiver

To begin with, both the traffic generator and the receiving function boast identical setups — a characteristic that remains consistent throughout this series.

Machine Type: h3-standard-88, made of 88 physical cores of Intel Sapphire Rapids and 352 GB of DDR5 memory. SMT (Simultaneous Multithreading) is disabled by default, ensuring consistently high computation and network performance.
vNUMA Topology: Automatically configured with 4 vNUMA nodes, each consisting of 22 cores. Closely mapping the underneath hardware setup.
Network Interfaces: Equipped with gVNIC and TIER_1 Networking enabled by default allowing for an upper limit of 200 Gbps.
IP Forwarding: Allowed.
Operating System: Running RHEL 9.3, leveraging the latest stable RHEL Kernel version 5.14.0–362.24.1 available at the time of writing.
Compact Placement Policy (CCP): Utilizing a CCP with a max-distance of 2. This policy aims to position VMs in adjacent racks to minimize network latency. While achieving the lowest latency would necessitate a maximum distance of 1, indicating placement within the same rack, resource constraints made it challenging to consistently procure available resources for VM deployment.

FD.io VPP

For the VPP machine, here are the key specifications and configurations:

Machine Types: Various machine types were tested (C3, C3D, and H3), all with SMT deactivated.
Network Interfaces: Equipped with gVNIC and TIER_1 Networking enabled by default allowing for an upper limit of 200 Gbps.
IP Forwarding: Allowed.
Compact Placement Policy: whenever possible (C3 and H3) in the same placement group of traffic generator and network receiver.

VPP Performance on the smallest GCP C3 Machine

1st topology — VPP on c3-highcpu-4 — 1 RX/1 TX queue — 1 PMD thread

During a 120-second testing period, VPP exhibited exceptional performance by successfully forwarding 2,173,395,993 packets, averaging an impressive 18.1 million packets per second (Mpps). Remarkably, VPP encountered minimal forwarding challenges, with only 9 packets unable to be pushed out during the recorded Polling Cycle, resulting in a super small drop rate (indicated by the Tx packet drops (dpdk tx failure) counter).

While the traffic generator demonstrated the capability to push well over 130 Mpps, the limitations (CPU and bandwidth) of the smallest c3-highcpu-4 machine, restricted its internal throughput to 10 Gbps. Considering a packet size of 64 bytes, this translates to approximately 19 Mpps — a rate nearly matched by the observed average.

With a packet drop rate of a mere 0.000000004%, this initial testing showcases an exceptional level of performance achievable right out of the box. While the system is finely tuned, it’s essential to acknowledge that these remarkable results are attainable by any GCP user. This outcome underscores the astonishing capabilities of VPP within the GCP environment, setting a new benchmark for network performance.

Pushing Beyond 100 Mpps

1st topology — VPP on h3-standard-88 — 6 RX/6 TX queue — 19 PMD threads

108 million packets per second!! To be fair, there are carriers that process less than this amount of packets, let alone through a single CNF.

To put that into perspective, 108 Mpps, on a standard IMIX, equate to approximately 300 Gbps 💀.

However, it’s essential to interpret the 62 dropped packets on VPP with caution. While this may suggest a low packet drop rate, it’s crucial to acknowledge that there could be other contributing factors at play. Nonetheless, it’s noteworthy that while TestPMD demonstrated the ability to generate a staggering 130 Mpps. DPDK + C3 architecture achieved commendable throughputs.

Again, don’t take the 62 packets dropped on VPP as a conclusive evidence of how low the packet drop rate is, there might be other places which we don’t know about. One thing sure is that TestPMD is able to generate 130 Mpps while the combo VPP + C3 architecture can “only” push out 108 Mpps.

Turning our attention to the bandwidth achievable with larger packet sizes, such as 1024 bytes, we observed a throughput of approximately 175 Gbps (or around 21 Mpps). Although I occasionally witness exceeded 180 Gbps, under standard 1500 bytes MTU conditions, achieving around 175 Gbps is a noteworthy accomplishment. The fairly large drop in TX packets may be attributed to fabric overload — an indication that we were nearing the limits of the underlying infrastructure.

These results are obtainable on the 1st Testing Topology using:

VPP Instance: Utilizing an h3-standard-88 instance for VPP, ensuring no system limit would hold us back.
RX and TX Queue Configuration: Configuring VPP with 6 RX queues and 1024 RX file descriptors, alongside 6 TX queues and 1024 TX file descriptors. More queues and the system would be unstable. More file descriptors and latency would take a hit.
NUMA Affinity: Leveraging NUMA 0 in its entirety for VPP, with a dedicated main-core worker on CPU1 and nineteen PMD cores spanning CPU2 to CPU21. Effectively leaving NUMA 1 to NUMA 3 unutilized.
Optimization Techniques: Disabling features such as TSO (TCP Segmentation Offload), TX Checksum Offload, and Multi-Segmentation (required for Jumbo frames yet not supported by the DPDK GVE driver), thereby eliminating potential overhead and maximizing performance.

Introducing AMD Genoa to FD.io VPP

It’s been over five years since AMD seized the x86 IPC crown, while Intel maintained supremacy in memory and I/O latencies. However, just a few years ago, with AMD Rome (based on the Zen2 architecture), the memory, cache, and NUMA topology, as displayed through the Nodes Per Socket (NPS), posed challenges for latency-sensitive Telco applications. Rome was effectively relegated primarily to IT applications. Let’s revisit this scenario within GCP and Genoa.

1st topology — VPP on c3d-highcpu-360 — 6 RX/6 TX queue — 90 PMD threads

While not quite hitting the 100 Mpps mark, the performance is impressive. Achieving this configuration required utilizing a c3d-highcpu-360 instance. Unfortunately, the VPP Compact Placement Policy had to be disabled for this setup. Over a 120-second interval, TestPMD received an average throughput of 98 Mpps, slightly lower than the C3 setup. However, due to the absence of the Compact Placement Policy, the comparison between the two setups isn’t entirely apples-to-apples. Notably, the drop rate, as indicated by Tx packet drops (dpdk tx failure), is higher than that observed with Intel Sapphire Rapids, showcasing Intel’s continued edge in low-latency and I/O aspects.

1st topology — VPP on c3d-highcpu-4 — 1 RX/1 TX queue — 1 PMD thread

Additionally, it’s essential to assess the performance using the smallest available C3D machine configuration: c3d-highcpu-4. It’s made of 2 physical cores and 8 GB of DDR5 memory. Here, VPP achieved a throughput of approximately 16.5 Mpps, slightly lower than the C3 result of 18.1 Mpps. However, the drop rate concluded the 120-second run at 84 packets, compared to the mere 9 drops observed with C3. Yet again this highlights the need to closely monitor performance metrics on AMD EPYC.

Pktgen-DPDK vs. VPP

Pktgen stands as a well-known open-source traffic generation tool, renowned for its DPDK acceleration and nuanced control over traffic profiles. While TestPMD excels at pushing packets at maximum speed, Pktgen offers greater flexibility in shaping traffic patterns.

However, my exploration revealed that Pktgen encounters challenges on GCP, frequently throwing stack traces and struggling to achieve stability beyond a combined throughput of 20 Mpps. This exploration goes under the second network topology bucket.

We simulated a truly multi-flow, bi-directional traffic pattern with UDP packets utilizing source port 2123, commonly used in Telco for GTP-C tunneling. Destination ports ranged from 10,000 to 50,000, with destination IPs spanning the class 49.0.0.0/8 and 17.0.0.0/8 for a total of 32 million possible IPs. Packet sizes ranged from 64 to 256 Bytes.

VPP, configured with a c3-highcpu-8 instance, featuring 4 physical cores and 16GB of memory, rose to the challenge admirably. Despite the demanding scenario, VPP ran for about 4 hours with minimal packet drops.

vpp on the left — pktgen main page on the right

For those interested in latency stats (network and packet processing), see the following:

Originally, I had intended to utilize NFVBench, a traffic generator developed by the OPNFV community, which leverages Cisco’s TRex traffic generator under the hood. However, TRex relies on DPDK and necessitates a specific integration layer to utilize new PMD NIC drivers. Ultimately, GVE support is lacking, and there are no immediate plans from the community to address this limitation. While Pktgen is indeed a formidable traffic generator, TRex outshines it in several aspects, particularly in its comprehensive traffic profile feature set and advanced reporting capabilities, especially regarding crucial metrics like drop rate, CRC verification, packet reordering etc. Without such a tool, accurately assessing the performance of Titanium and C3 instances, and their suitability for Telco network applications, becomes challenging.

Sending and Receiving without VPP

In our final experimental test, we examined the network’s capabilities without VPP intervention, as indicated by the third topology where routing is managed via VPC Peering.

TestPMD sending data to another TestPMD at 125 Mpps

Over the standard 120-second duration, the network delivered an average throughput of approximately 120 Mpps. However, it’s worth noting that the traffic exhibited some burstiness, with frequent fluctuations in throughput ranging over several millions of packets per second.

In contrast, the observed 108 Mpps throughput achieved by VPP is truly remarkable and serves as a testament to the platform’s robustness and capability to support highly demanding Telco applications.

Conclusions and Next Steps

In conclusion, our exploration of DPDK, VPP, and high-performance networking on GCP reveals the impressive capabilities of these technologies for handling demanding workloads. The testing results, particularly with VPP on C3 and H3 instances, demonstrate the potential for achieving exceptional throughput and low latency. While challenges remain, such as the need for more robust traffic generation tools like TRex, the future of cloud-based packet processing looks promising.

I leave you with the promise of returning soon with a comprehensive guide, should you decide to run VPP on GCP. Additionally, I look forward to exploring the capabilities of C4 instances in the future. Until then, stay tuned for more updates and insights.

Forwarding over 100 Mpps with FD.io VPP on x86

A Bit of History

What’s NFV and DPDK?

How about FD.io VPP?!

The Technology Enabling DPDK on GCP

Configure per VM Tier_1 networking performance | Compute Engine Documentation | Google Cloud

Review network bandwidth limits and enable higher network bandwidths using gvNIC and per VM Tier_1 networking…

The Technology Enabling DPDK on Linux

Tuning for Zero Packet Loss in Red Hat OpenStack Platform - Part 1

For Telcos considering OpenStack, one of the major areas of focus can be around network performance. While the…

Tuning for Zero Packet Loss in Red Hat OpenStack Platform - Part 2

In Part 1 we reviewed the fundamentals of achieving zero packet loss, covering the concepts behind the process. In his…

Tuning for Zero Packet Loss in Red Hat OpenStack Platform - Part 3

In Part 1 of this series Federico Iezzi, EMEA Cloud Architect with Red Hat covered the architecture and planning…

The Role of isolcpus Kernel Argument

Behavior of CPU manager with realtime and cpu-partitioning tuned profiles - Red Hat Customer Portal

OpenShift provides the Performance Addon Operator and the Cluster Node Tuning Operator for fine tuning your workers for…

Our Objectives

Testing Topology

1st testing topology — TestPMD and VPP

2nd testing topology — Pktgen and VPP

3rd testing topology — VPC Peering and TestPMD

The GCP Platform

Traffic Generator and Network Receiver

FD.io VPP

VPP Performance on the smallest GCP C3 Machine

Pushing Beyond 100 Mpps

Introducing AMD Genoa to FD.io VPP

Pktgen-DPDK vs. VPP

Sending and Receiving without VPP

Conclusions and Next Steps

Written by Federico Iezzi