Introducing Walmart’s L3AF Project: Control plane, chaining eBPF programs, and open-source plans.

Karan Dalal
Walmart Global Tech Blog
8 min readAug 24, 2021

This is the third blog in a three-part series introducing the L3AF project that provides Kernel Function as a Service using eBPF and related technologies. Though this blog can be read independently, we recommend reading the first and second blog before reading this one.

The popularity of eBPF is rapidly growing. There are more and more eBPF programs being written to solve a wide variety of problems. Several startups are building technologies around eBPF, and large technology companies like Facebook, Netflix, and even Microsoft are embracing eBPF to solve large-scale problems. At Walmart, we too are embracing eBPF and using it to solve similar problems.

A challenge we faced when first adopting eBPF was how to manage and orchestrate multiple eBPF programs on a large scale. We require to run numerous eBPF programs on a given node and, we have thousands of nodes across many DCs in a hybrid cloud environment using multiple cloud providers. Due to the lack of an enterprise-ready solution, we decided to develop our own control plane. This control plane orchestrates and composes independent eBPF programs across our network infrastructure to solve crucial business problems. Our control plane is a vital component of L3AF.

Enabling Kernel Function as a Service

Our control plane consists of multiple components that work together to orchestrate eBPF programs:

  • L3AF Daemon (L3AFD), which runs on each node where KF runs. L3AFD reads configuration data and manages the execution and monitoring of KFs running on the node. L3AFD is written in Golang.
  • Deployment APIs, which a user calls to generate configuration data. This configuration data includes which KFs will run, their execution order, and the configuration arguments for each KF.
  • A database and KV store that stores the configuration data.
  • A datastore that stores the KF byte code.

The control plane is shown graphically here:

L3AF Ecosystem

As seen in the above diagram, eBPF KFs can be community developed, third-party vendor ones, or the ones available from L3AF. The L3AF build engine pulls the Kernel Function source code, compiles the source code against different kernel versions, and pushes the bytecode to an artifact management solution.

When users want to deploy a KF, they can call the L3AFD API with appropriate parameters. This request would generate a new config (KV pair) that will be saved in a database and distributed across all the hosts using a config distribution mechanism.

Once L3AFD reads this new config, it orchestrates kernel functions on the Linux host as per the defined parameters. If the user gives a set of kernel functions, then L3AFD can orchestrate all of them in the sequence that the user wanted.

Executing eBPF programs in a sequence is called “chaining,” and it is quite complex (at least for the kernel versions we use in production). Let us do a deep dive into how L3AFD makes this possible. Below is a diagram followed with some explanation:

L3AFD Orchestration

At a high level, what this diagram shows is that L3AFD chains eBPF KFs by leveraging eBPF maps. Chaining is achieved by having the next program’s file descriptor (fd) stored in a map created by the previous program. A given eBPF kernel program then calls the next eBPF kernel program by using the bpf_tail_call kernel functionality. This happens repeatedly until the end of the chain is reached.

In actuality, for each chain, only the first eBPF kernel program is attached to the network interface. Subsequent eBPF programs in the chain are essentially called on behalf of this first program. Due to this, there is a requirement for a “root” passthrough program. The root program allows chains to rebuild without detaching or reattaching the eBPF program to the interface.

Below is the list of steps involved in creating the chain:

  • The config that L3AFD receives includes the network interface and program type (XDP, TC ingress, and TC egress). It also includes a sequence number that indicates the position of a KF in the chain. Based on the information that is available in the config, L3AFD downloads artifacts from the KF datastore onto the node.
  • If the sequence number is 1 (a KF is the first eBPF program), then L3AFD will perform the following:
1. Start the appropriate type of root program (i.e., XDP or TC) depending on the program type. This root program will use libbpf APIs or TC hooks to attach the root kernel bytecode to the Network Interface.2. Start the “user prog1” of the first KF using APIs with start arguments and this “user prog1” loads the bytecode of “kernel func1” and updates func1 fd into the root map using eBPF APIs.
  • If the sequence number is X (somewhere in the middle of the chain), L3AFD will perform the following:
1. Start the KF and update the next KF’s program fd (X+1) to the progX map.2. Update the progX map into the previous KF’s program map (X-1) like an insertion in the linked list using Cilium’s eBPF library APIs.
  • If the sequence number is Z (last KF in the chain), then L3AFD will perform the following:
1. Start the KF and update the progZ map into the previous KF’s program map (Z-1).

L3AF adheres to the “build once, deploy everywhere” philosophy, wherein we would build the deployment package once for any environment (i.e., multiple kernel versions) and set configuration at deploy time.

L3AFD has other duties, too. It monitors KF health, gathers configurable KF metrics, and manages KF resource utilization. This health and metrics data gets exported in a format compatible with PromQL.

L3AFD provides an API for configuration, so users may use their existing systems for configuration distribution to L3AFD nodes. For example, users may use etcd, consul, or a custom in-house solution to distribute configuration data to the L3AF nodes. Then, a small service can be implemented to convert the configuration data to L3AFD API calls. The L3AF team is interested in various projects that are aiming to standardize configuration distribution for hybrid cloud environments, and future versions of L3AF may move in that direction.

Speaking of the future, this is just the beginning for L3AF. We have many exciting ideas for the future.

Future Plans

eBPF is cutting-edge technology. eBPF features are frequently added in new kernel versions. Also, new userspace tools and libraries are emerging, which weren’t available when we started developing L3AF a couple of years ago.

At present, as shown in the diagram above, L3AFD executes separate C userspace programs, which load corresponding eBPF Kernel Functions. However, a pure Go eBPF userspace library has been created (by Cilium). We plan to leverage that library to port all of our C userspace code to Go, which should simplify our userspace code and expand the capabilities of what the userspace eBPF programs can do. We plan to implement the new Go userspace programs as RPC-based plugins for improved process control and communication.

As mentioned, KF chaining is quite complex. However, it can be simplified in Linux kernel versions 5.10 and higher. With these kernel versions, libxdp and its XDP Dispatcher functionality make things much more straightforward (but we will keep the support for chaining with older kernels). TC chaining can also be simplified using TC’s userspace tools.

Another crucial part of our L3AF project is to establish a “Kernel Function Marketplace,” where KF developers and users can share their own signed KFs and download KFs from others. L3AF can then be used to orchestrate and compose selected KFs from the marketplace to several business needs. A vital prerequisite to the kernel function marketplace is to open source the L3AF project, which is a top priority for Walmart and our team.

This concludes our three-part blog series on eBPF networking solutions at Walmart. We’ve discussed network visibility, XDP packet-processing, and developing our own control plane. Together, these parts have come together to form our L3AF model that we’re using to solve difficult problems in innovative ways. Before we end, we’d like to give an overview of the benefits of our L3AF model.

Benefits of L3AF Model

Many commercial solutions can aggregate and route traffic to relevant monitoring and analysis tools. However, these solutions are proprietary which limits their offerings to the features and functions that have been developed by their engineers. The idea behind L3AF is to provide an open and extensible platform that offers certain Kernel Functions out of the box Kernel Functions. And also enables users to add offerings dynamically to our KF ecosystem as per their use-cases and requirements.

Additionally, L3AF can support use-cases where action needs to be taken in the direct path of traffic. A few examples of such use-cases are packet tagging, rate limiting, load-balancing, and traffic direction. Such use-cases are not possible to achieve with traditional agent-based solutions, as most of them are still running their programs in the TCP/IP stack. L3AF leverages eBPF, which allows us to run these in the kernel with ultra-low overhead. And, also offers capabilities that other commercially available tools will not be able to unless they go through major design and architecture level changes. All these give L3AF the first-mover advantage in this space.

To summarize, below are the key benefits of our L3AF model:

Technical Benefits are as follows:

  • One-Stop-Shop for all Kernel Functions, thereby avoiding vendor and cloud lock-ins.
  • Distributes Kernel Functions across hosts eliminating appliance and centralized choke points in the network (and acts as close to the source as possible).
  • Supports Kernel Function chaining to achieve desired workflows.
  • Leverages eBPF for data path, so gives an ultra-high performance.
  • Reduces any additional hops that may be otherwise necessary to perform the Kernel Functions managed by L3AFD.

Business Benefits are as follows:

  • Flexible platform to configure, customize and monitor the Kernel Functions according to the user requirements.
  • Reduce licensing expense and overhead of managing Kernel Functions across numerous vendor products and making them work together.
  • Reduce integration costs of implementing new Kernel Functions across platforms and avoid being limited by the lowest common denominator across platforms.
  • Reduce network hops avoiding additional public cloud traffic costs and bottlenecks.
  • Offers Cutting edge technology solutions with capabilities that are yet to be available in commercial tools.

The L3AF project is also developing solutions in the realm of observability, monitoring, and tracing. We will continue to share our work through blogs as we work towards open-sourcing this project. Thanks for tuning in!

This blog is written with inputs from Santhosh Fernandes and Brian who are engineers on the L3AF Project.

--

--

Karan Dalal
Walmart Global Tech Blog

Building traffic platforms for the world’s largest retailer, Passionate about systems engineering and reliability.