Forwarding over 100 Mpps with FD.io VPP on x86 — Part2
In Part 1, we established the foundational concepts by delving into the theories supporting NFV and DPDK, along with the observed throughput achieved across a series of experiments. Now, in this second part, I will delve into the optimal approach for deploying DPDK applications on GCP, utilizing FD.io VPP and TestPMD as prime examples.
Furthermore, I’d like to address an additional point that was overlooked in the previous article: the attainable metrics when utilizing the largest available machine for FD.io VPP, specifically the h3-standard-88
, with just a single PMD thread. In Part 1, a similar yet distinct experiment was conducted, employing VPP on a c3-highcpu-4
instance restricted to 10Gbps of traffic. While VPP achieved the 10Gbps threshold, the extent of further potential remained ambiguous. This experiment serves a dual purpose:
- Presently, 200Gbps / 100 Mpps of throughput is solely achievable on the largest available machines. However, in the future, Google might reassess this spec, and understanding the maximum capacity of the underlying infrastructure with minimal resources is crucial for making informed decisions.
- Secondly, it delves into the optimization of DPDK and FD.io VPP. While the initial post indicated the robust development and scalability of these technologies, their true efficacy can only be discovered when operating under minimal resource constraints.
· Maximum throughput with least resources
∘ Analyzing packet drops in depth
· Rolling out the GCP environment
∘ VPC and Routing
∘ Setting Up GCE Instances with Compact Placement Policy
· OS config and deterministic setup
∘ VPP Startup Configuration
∘ VPP Base Configuration
· Configuring DPDK TestPMD
· Conclusions
Maximum throughput with least resources
Let’s delve into the numbers. The setup mirrors that of Part 1, lacking of any OS or kernel updates, and no newer DPDK, VPP, or Podman versions. Employing the same network topology, VPP operates on an h3-standard-88
instance, with a single PMD Thread running on CPU1, while CPU0 serves to both OS housekeeping and as the main core for VPP’s internal tasks.
A snapshot captured just 5 seconds before the test’s conclusion reveals a strikingly minimal packet drop rate by VPP. Surprisingly, the throughput remains capped at 10 Gbps / 18.6 Mpps due to the utilization of a single gVNIC queue. While additional queues could potentially enhance throughput, this would necessitate more CPUs, rendering the test invalid. (Indeed, VPP and DPDK can distribute multiple queues on a single PMD, but pragmatically, more queues equals to more CPUs).
Analyzing packet drops in depth
The screenshot above provides a comprehensive summary of all packets forwarded by VPP over a 120-second window and subsequently received by the Network Receiver. With TestPMD transmitting at 130Mpps, any calculations regarding packet drops on that end become futile due to the sheer volume of billions of TX drops. In the physical world analyzing what happens on a single leg (here a VPC) would be the correct thing to do.
During this period, VPP successfully processed and transmitted a staggering 2,241,375,624 packets. Only 12 were dropped in the forwarding process. Although the Network Receiver received 2,241,173,874 packets, there is a discrepancy of 201,750 unaccounted packets. While it’s plausible that these were lost in transit within the GCP underlay, the transparency of VPP and TestPMD error counters allows us to make such assumptions. By calculating the packet drop rate, we determine that approximately 0.00009% of packets were lost (201,750 lost packets out of 2,241,375,624 sent).
Having extensive experience with numerous Telco carriers across Europe and North America, I can attest that a 0.00009% drop rate is nothing short of remarkable, particularly in the context of the public cloud industry.
Rolling out the GCP environment
Done with this latest experiment, we now have a comprehensive understanding of the performance capabilities of FD.io VPP on GCP. Let’s turn our focus into the practical implementation, starting with the deployment of the GCP components.
The instructions provided here are tailored to the first testing topology, focusing on TestPMD and VPP. While the details for the second topology are not explicitly outlined, following these instructions should facilitate its reproduction.
VPC and Routing
First and foremost, let’s address the fundamental networking infrastructure by creating the management, left, and right VPCs along with their associated subnets in the europe-west4
region:
gcloud compute networks create mgmt \
--subnet-mode=custom \
--mtu=1500 \
--bgp-routing-mode=regional
gcloud compute networks subnets create mgmt \
--network=mgmt \
--range=172.16.128.0/27 \
--stack-type=IPV4_ONLY \
--enable-private-ip-google-access \
--region=europe-west4
gcloud compute networks create left \
--subnet-mode=custom \
--mtu=8896 \
--bgp-routing-mode=regional
gcloud compute networks subnets create left \
--network=left \
--range=10.10.1.0/24 \
--stack-type=IPV4_ONLY \
--region=europe-west4
gcloud compute networks create right \
--subnet-mode=custom \
--mtu=8896 \
--bgp-routing-mode=regional
gcloud compute networks subnets create right \
--network=right \
--range=10.10.2.0/24 \
--stack-type=IPV4_ONLY \
--region=europe-west4
Next, we configure the routing between the VPCs:
gcloud compute routes create range-48-to-vpp-left \
--network=left \
--priority=1000 \
--destination-range=48.0.0.0/8 \
--next-hop-address=10.10.1.10
gcloud compute routes create range-48-from-vpp-right \
--network=right \
--priority=1000 \
--destination-range=48.0.0.0/8 \
--next-hop-address=10.10.2.40
Finally, we set up the VPC Cloud Firewall rules to allow traffic to flow through:
gcloud compute firewall-rules create left-allow-right \
--direction=INGRESS --priority=1000 --network=left \
--action=ALLOW --rules=all --source-ranges=10.10.2.0/24
gcloud compute firewall-rules create right-allow-left \
--direction=INGRESS --priority=1000 --network=right \
--action=ALLOW --rules=all --source-ranges=10.10.1.0/24
gcloud compute firewall-rules create left-allow-itself \
--direction=INGRESS --priority=1000 --network=left \
--action=ALLOW --rules=all --source-ranges=10.10.1.0/24
gcloud compute firewall-rules create right-allow-itself \
--direction=INGRESS --priority=1000 --network=right \
--action=ALLOW --rules=all --source-ranges=10.10.2.0/24
gcloud compute firewall-rules create mgmt-allow-iap \
--direction=INGRESS --priority=1000 --network=mgmt \
--action=ALLOW --rules=tcp:22 --source-ranges=35.235.240.0/20
Setting Up GCE Instances with Compact Placement Policy
Next, we set up the GCE instances. They are all configured similarly, with the primary differences being the number of gVNICs and, potentially, the machine type. The first step is to deploy a Compact Placement Policy (CPP) in the europe-west4
region. As explained in Part 1, a max-distance
setting of 1 would yield better results and lower latency, but it is more challenging to achieve. Therefore, we use a max-distance
setting of 2, which is readily available on GCP, as we’re going to name it dpdk-compact
:
gcloud beta compute resource-policies create group-placement dpdk-compact \
--collocation=collocated \
--max-distance=2 \
--region=europe-west4
Next, we reserve the IP addresses for the left and right gVNIC interfaces of VPP:
gcloud compute addresses create vpp-left \
--addresses=10.10.1.10 --region=europe-west4 \
--subnet=left --purpose=GCE_ENDPOINT
gcloud compute addresses create vpp-right \
--addresses=10.10.2.10 --region=europe-west4 \
--subnet=right --purpose=GCE_ENDPOINT
Finally, we set up the VPP GCE. The following gcloud
CLI command will create an h3-standard-88
instance in the europe-west4-b
zone with three network gVNIC interfaces (respectively management, left, and right — the order specified through the API is enforced by the GCE hypervisor). It will also enable packet forwarding (essential for a router 😆), use the RHEL9 OS, and utilize the dpdk-compact
CPP:
gcloud compute instances create vpp \
--zone=europe-west4-b \
--machine-type=h3-standard-88 \
--min-cpu-platform=Intel\ Sapphire\ Rapids --threads-per-core=1 \
--network-interface=network-tier=PREMIUM,nic-type=GVNIC,stack-type=IPV4_ONLY,subnet=mgmt \
--network-interface=nic-type=GVNIC,private-network-ip=10.10.1.10,stack-type=IPV4_ONLY,subnet=left,no-address \
--network-interface=nic-type=GVNIC,private-network-ip=10.10.2.10,stack-type=IPV4_ONLY,subnet=right,no-address \
--can-ip-forward \
--maintenance-policy=TERMINATE --provisioning-model=STANDARD \
--service-account=335681811111-compute@developer.gserviceaccount.com --scopes=https://www.googleapis.com/auth/cloud-platform \
--create-disk=boot=yes,device-name=vpp,image=projects/rhel-cloud/global/images/rhel-9-v20240312,mode=rw,size=100,type=projects/vpp-on-gcp-68333/zones/europe-west4-b/diskTypes/pd-balanced \
--no-shielded-secure-boot --shielded-vtpm --shielded-integrity-monitoring \
--labels=goog-ec-src=vm_add-gcloud --reservation-affinity=any \
--resource-policies=dpdk-compact
Additionally, here is a single script for both the Traffic Generator and Network Receiver:
gcloud compute addresses create testpmd-tx \
--addresses=10.10.1.40 --region=europe-west4 \
--subnet=left --purpose=GCE_ENDPOINT
gcloud compute addresses create testpmd-rx \
--addresses=10.10.2.40 --region=europe-west4 \
--subnet=right --purpose=GCE_ENDPOINT
gcloud compute instances create testpmd-tx \
--zone=europe-west4-b \
--machine-type=h3-standard-88 \
--min-cpu-platform=Intel\ Sapphire\ Rapids --threads-per-core=1 \
--network-interface=network-tier=PREMIUM,nic-type=GVNIC,stack-type=IPV4_ONLY,subnet=mgmt \
--network-interface=nic-type=GVNIC,private-network-ip=10.10.1.40,stack-type=IPV4_ONLY,subnet=left,no-address \
--can-ip-forward \
--maintenance-policy=TERMINATE --provisioning-model=STANDARD \
--service-account=335681811111-compute@developer.gserviceaccount.com --scopes=https://www.googleapis.com/auth/cloud-platform \
--create-disk=boot=yes,device-name=testpmd-tx,image=projects/rhel-cloud/global/images/rhel-9-v20240312,mode=rw,size=100,type=projects/vpp-on-gcp-68333/zones/europe-west4-b/diskTypes/pd-balanced \
--no-shielded-secure-boot --shielded-vtpm --shielded-integrity-monitoring \
--labels=goog-ec-src=vm_add-gcloud --reservation-affinity=any \
--resource-policies=dpdk-compact
gcloud compute instances create testpmd-rx \
--zone=europe-west4-b \
--machine-type=h3-standard-88 \
--min-cpu-platform=Intel\ Sapphire\ Rapids --threads-per-core=1 \
--network-interface=network-tier=PREMIUM,nic-type=GVNIC,stack-type=IPV4_ONLY,subnet=mgmt \
--network-interface=nic-type=GVNIC,private-network-ip=10.10.2.40,stack-type=IPV4_ONLY,subnet=right,no-address \
--can-ip-forward \
--maintenance-policy=TERMINATE --provisioning-model=STANDARD \
--service-account=335681811111-compute@developer.gserviceaccount.com --scopes=https://www.googleapis.com/auth/cloud-platform \
--create-disk=boot=yes,device-name=testpmd-rx,image=projects/rhel-cloud/global/images/rhel-9-v20240312,mode=rw,size=100,type=projects/vpp-on-gcp-68333/zones/europe-west4-b/diskTypes/pd-balanced \
--no-shielded-secure-boot --shielded-vtpm --shielded-integrity-monitoring \
--labels=goog-ec-src=vm_add-gcloud --reservation-affinity=any \
--resource-policies=dpdk-compact
OS config and deterministic setup
At this stage, we have set up the VPC, routing, Cloud Firewall rules, CPP, and the three GCE instances used in the first network topology. The final step is to configure the operating system. Below is the script, followed by an explanation:
#!/bin/bash
dnf config-manager --set-enabled rhui-codeready-builder-for-rhel-9-x86_64-rhui-rpms
dnf config-manager --set-enabled rhui-codeready-builder-for-rhel-9-x86_64-rhui-debug-rpms
dnf config-manager --set-enabled rhui-codeready-builder-for-rhel-9-x86_64-rhui-source-rpms
dnf makecache
dnf install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm
dnf upgrade -y
dnf install -y tuned-profiles-cpu-partitioning tuned driverctl \
screen sysstat pciutils irqbalance container-tools numactl
systemctl enable --now irqbalance
echo "isolated_cores=1-87" | tee -a /etc/tuned/cpu-partitioning-variables.conf
# echo "no_balance_cores=1-87" | tee -a /etc/tuned/cpu-partitioning-variables.conf
systemctl enable --now tuned
tuned-adm profile cpu-partitioning
grubby --update-kernel=ALL --args="default_hugepagesz=1G hugepagesz=1G hugepages=32"
cat > /etc/modprobe.d/vfio.conf << EOF
options vfio enable_unsafe_noiommu_mode=Y
options vfio_iommu_type1 allow_unsafe_interrupts=Y
EOF
driverctl set-override 0000:00:04.0 vfio-pci
driverctl set-override 0000:00:05.0 vfio-pci
dracut -f
sed -e "s/^SELINUX=.*$/SELINUX=disabled/g" -i /etc/selinux/config
podman pull docker.io/ligato/vpp-base:24.02-release
mkdir -p {/etc/vpp,/var/log/vpp}
cat > /etc/vpp/startup.conf << EOF
unix {
nodaemon
log /var/log/vpp/vpp.log
full-coredump
cli-listen /run/vpp/cli.sock
startup-config /etc/vpp/base.conf
gid vpp
}
api-trace { on }
api-segment { gid vpp }
socksvr { default }
memory {
main-heap-size 2G
main-heap-page-size 1G
default-hugepage-size 1G
}
cpu {
main-core 1
corelist-workers 2-21
}
buffers {
buffers-per-numa 128000
default data-size 2048
}
dpdk {
dev default {
num-rx-queues 6
num-tx-queues 6
tso off
num-rx-desc 1024
num-tx-desc 1024
}
uio-driver vfio-pci
socket-mem 4096
no-multi-seg
no-tx-checksum-offload
max-simd-bitwidth 512
}
EOF
cat > /etc/vpp/base.conf << EOF
set interface ip address VirtualFunctionEthernet0/4/0 10.10.1.10/32
set interface ip address VirtualFunctionEthernet0/5/0 10.10.2.10/32
set interface mtu 1500 VirtualFunctionEthernet0/4/0
set interface mtu 1500 VirtualFunctionEthernet0/5/0
set interface state VirtualFunctionEthernet0/4/0 up
set interface state VirtualFunctionEthernet0/5/0 up
ip neighbor VirtualFunctionEthernet0/4/0 10.10.1.1 42:01:0a:0a:01:01
ip neighbor VirtualFunctionEthernet0/5/0 10.10.2.1 42:01:0a:0a:02:01
ip route add 10.10.1.0/24 via 10.10.1.1 VirtualFunctionEthernet0/4/0
ip route add 10.10.2.0/24 via 10.10.2.1 VirtualFunctionEthernet0/5/0
ip route add 16.0.0.0/8 via 10.10.1.1
ip route add 48.0.0.0/8 via 10.10.2.1
EOF
podman create \
--name vpp \
-it \
--privileged \
--pid=host \
--cap-add=ALL \
--network host \
--volume /etc/vpp:/etc/vpp \
--volume /var/log/vpp:/var/log/vpp \
--volume /dev:/dev \
--volume /sys/bus/pci/drivers:/sys/bus/pci/drivers \
--volume /sys/kernel/mm/hugepages:/sys/kernel/mm/hugepages \
--volume /sys/devices/system/node:/sys/devices/system/node \
docker.io/ligato/vpp-base:24.02-release
podman generate systemd --name vpp --new --restart-policy on-failure > /etc/systemd/system/container-vpp.service
systemctl daemon-reload
systemctl enable container-vpp.service
echo "## Please reboot"
- Additional repositories, specifically CodeReady Linux Builder and EPEL, are enabled to install packages not available in BaseOS and AppStream.
- Upgrade the system — all tests were executed on RHEL 9.3 using kernel
5.14.0–362.24.1
. - Several utilities are installed, including:
–tuned
and the CPU Partitioning profile for system partitioning.
–driverctl
to remap the gVNIC interfaces to userland.
–irqbalance
, required by tuned for IRQ affinity.
–container-tools
for Podman (Yes, VPP is containerized!!). irqbalance
is configured to run at boot.- Tuned’s CPU Partitioning profile is set to isolate all cores except CPU0 (the config for isolcpus is commented out).
- Hugepages are statically allocated at boot, which, although somewhat archaic, works fine despite lacking flexibility in NUMA node page allocation.
- VFIO is set up to operate in No-IOMMU mode.
- The gVNIC interfaces (PCI IDs 0000:00:04.0 and 0000:00:05.0) are remapped to userland using the
vfio-pci
driver. initramfs
is regenerated to ensure all new modules (vfio and those from tuned) are included in thevmlinuz
.- Daniel Walsh will complain — SELinux is disabled. SORRY Dan! Despite being a long-time SELinux advocate, due to VPP compatibility issues, I had no choice. This is partly due to Red Hat’s discontinuation of CentOS, leading many open-source communities to cease development and testing for CentOS and thus RHEL. VPP doesn’t ship, build, and test anymore their custom SELinux Policy 😭.
- The Ligato VPP docker image is locally pulled, and necessary VPP configuration and log directories are created.
- VPP startup and base configurations are placed in the system so that VPP is ready to go once it is up.
- Finally, the VPP container is created with privileged status and all capabilities allowed. Following Podman’s way-of-doing-things 🙃 a Systemd service is created to ensure VPP runs at startup.
At this point, make sure of rebooting the system.
Follows also the script to deploy TestPMD on both Traffic Generator and Network Receiver systems:
#!/bin/bash
dnf config-manager --set-enabled rhui-codeready-builder-for-rhel-9-x86_64-rhui-rpms
dnf config-manager --set-enabled rhui-codeready-builder-for-rhel-9-x86_64-rhui-debug-rpms
dnf config-manager --set-enabled rhui-codeready-builder-for-rhel-9-x86_64-rhui-source-rpms
dnf makecache
dnf install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm
dnf upgrade -y
dnf install -y tuned-profiles-cpu-partitioning tuned driverctl \
screen sysstat pciutils irqbalance container-tools numactl \
kernel-headers-$(uname -r) libpcap-devel meson ninja-build python3-pyelftools numactl-devel
dnf group install -y "Development Tools"
DPDK_VER="24.03"
export RTE_TARGET="/root/dpdk-${DPDK_VER}"
export RTE_SDK="${RTE_TARGET}"
curl -O -L https://github.com/DPDK/dpdk/archive/refs/tags/v${DPDK_VER}.tar.gz
tar xf "v${DPDK_VER}.tar.gz"
cd "${RTE_TARGET}"
meson -Dexamples=all build
ninja -C build
ninja -C build install
ldconfig
systemctl enable --now irqbalance
echo "isolated_cores=1-87" | tee -a /etc/tuned/cpu-partitioning-variables.conf
# echo "no_balance_cores=1-87" | tee -a /etc/tuned/cpu-partitioning-variables.conf
systemctl enable --now tuned
tuned-adm profile cpu-partitioning
grubby --update-kernel=ALL --args="default_hugepagesz=1G hugepagesz=1G hugepages=128"
# https://bugzilla.redhat.com/show_bug.cgi?id=1762087
# vIOMMU not supported
cat > /etc/modprobe.d/vfio.conf << EOF
options vfio enable_unsafe_noiommu_mode=Y
options vfio_iommu_type1 allow_unsafe_interrupts=Y
EOF
driverctl set-override 0000:00:04.0 vfio-pci
dracut -f
sed -e "s/^SELINUX=.*$/SELINUX=disabled/g" -i /etc/selinux/config
echo "## Please reboot"
The only significant difference is that we need to compile DPDK 24.03, which involves a few additional packages and steps. Otherwise, everything else remains the same.
VPP Startup Configuration
In the VPP startup configuration, we define how VPP interacts with the system, including the daemon configuration. Let’s break it down:
- Container-Specific Configuration: Settings optimized for running VPP in a container.
- Tracing API: Enabled for debugging and performance monitoring.
- Memory Configuration:
– HEAP size and default set to 1GB of HugePages. - CPU Configuration:
– Main core assigned to CPU1.
– DPDK PMD Threads (defined as workers in VPP) assigned to CPU 2 through 21 (all on NUMA node 0). - Buffer Allocation:
– 128,000 buffers allocated. - DPDK Interfaces:
– Using thevfio-pci
driver.
– Allocating 4 HugePages of 1GB each.
– Configuring 6 RX and 6 TX queues per interface.
– Setting 1024 file descriptors for both RX and TX. - Checksum Offload:
– TX checksum offload is disabled due to the dependency on multi-segmentation.
– Multi-segmentation, required by Jumbo Frames, is not supported by the gVNIC PMD. Disabling multi-segmentation can improve performance, hence the settingsno-multi-seg
andno-tx-checksum-offload
. - Instruction Set:
– AVX512 is used for enhanced performance. Intel has a great white paper on the subject.
VPP Base Configuration
The base configuration defines the network settings VPP will use at startup:
- IP Address Configuration: each interface is assigned an IP address in accordance with the GCP standard of a
/32
in every interface. - MTU Setting: set to 1500 bytes.
- Interface State: interfaces are brought up (equivalent to
no sh
in Cisco IOS). - MAC Address Configuration: since broadcast does not exist on GCP, and the interface has a
/32
the MAC address of the VPC Gateway must be statically defined. - Static Routing: implemented the 1st network topology logic.
Configuring DPDK TestPMD
When compiling DPDK, TestPMD is also compiled. To run the traffic generator, follow these steps:
dpdk-testpmd \
-a 00:04.0 \
-l 1-21 -- \
--forward-mode=txonly \
--txpkts=64 \
--txq=16 \
--rxq=16 \
--nb-cores=20 \
--stats-period 5 \
--txonly-multi-flow \
--tx-ip=10.10.1.40,48.0.0.1 \
--eth-peer=0,42:01:0a:0a:01:01
TestPMD will start in TX-only mode, generating packets of 64 bytes. It utilizes 16 TX and RX queues and can potentially use up to 21 cores (all from NUMA node 0). TestPMD automatically allocates the necessary resources. The packets are sent from the source IP address 10.10.1.40
to the destination IP address 48.0.0.1
, with Andromeda handling the routing to VPP.
Now, let’s focus on the Network Receiver. This process is straightforward:
dpdk-testpmd \
-a 00:04.0 \
-l 1-21 -- \
--forward-mode=rxonly \
--txq=16 \
--rxq=16 \
--nb-cores=20 \
--stats-period 5
The configuration for the Network Receiver is lightly dissimilar to that of the traffic generator. However, instead of operating in TX-only mode, it runs in RX-only mode to receive all incoming traffic. The same number of queues and cores are defined. Configuration settings related to traffic generation are omitted.
Conclusions
In this two-part exploration of forwarding performance using FD.io VPP on x86, we’ve journeyed from foundational concepts to practical deployment on GCP. Part 1 laid the groundwork, discussing NFV and DPDK theories, culminating in a series of experiments that demonstrated mind-blowing throughput capabilities. In Part 2, we dove deeper into the implementation on GCP, focusing on optimal deployment strategies for DPDK applications utilizing FD.io VPP and TestPMD. Through several experiments, I’ve demonstrated that achieving over 100 Mpps throughput is possible but currently limited to the largest available machine configurations, h3-standard-88
instance.
Our deep dive into packet drops revealed a strikingly low drop rate of 0.00009%, a metric that far exceeds industry expectations, especially in public cloud environments. This success highlights not only the efficiency of VPP in handling high packet rates but it’s also a testament for the fine work done at Google with GCP.
The detailed setup process, from configuring VPCs and routing to deploying and fine-tuning GCE instances, has provided a clear blueprint for replication. Configuring the OS for deterministic performance and containerizing VPP ensures that high performance is maintained consistently. The VPP startup and base configurations, along with the deployment of TestPMD for traffic generation and network receiver, have been carefully crafted to demonstrate the practical application of these technologies.
In conclusion, our exploration underscores the viability of using FD.io VPP on GCP for high-performance networking. By optimizing configurations and understanding the underlying infrastructure, network engineers and architects can harness the full potential of VPP and DPDK. As technology evolves, these insights will become increasingly valuable in pushing the boundaries of what’s achievable in cloud-based networking.