Bottom-up eBPF
With all the articles on eBPF it’s starting to remind me of Monads: people seem to either understand them fully or not at all, and will happily write articles about them either way. I’m probably still in the latter group even after years using them in Haskell. Ask me to invent a Monad? Forget about it.
The issue I’ve had with a lot of the bpf material out there is that the majority is introductory. The remainder seems to be for higher level frameworks such as BCC but, of course, I haven’t clicked on every link I’ve found. The higher level frameworks are going to take a top-down approach because the whole point is to teach the tool that makes bpf easier to write and/or interact with. They also bring in more dependencies and are arguably heavier weight.
I wanted to write a lightweight, always-on, runs-everywhere program which meant learning bpf from the bottom up.
Note: I’m not a bpf or kernel expert by any means, I just want to share some hard-won knowledge.
This article is based on Linux 4.19
Kernel and bpf program headers
include/linux/bpf.h
This is a header for the kernel implementation of bpf. Not much to look at if you’re trying to write a bpf program, however.
include/uapi/linux/bpf.h
This is the header to include as <linux/bpf.h> in a bpf program.
You should definitely have a look at the bpf/3 (link; the /3 is function arity) system call man page to get familiar with some of the concepts. It disappoints in the eBPF program types section with, as of this writing, only BPF_PROG_TYPE_SOCKET_FILTER is documented. Time to look at some more source code.
bpf enums
The first parameter of the bpf/3 system call is an int cmd. Those are found on enum bpf_cmd (link). So rather than a bpf_map_create() system call, there’s just the single bpf/3 system call with a int cmd of BPF_MAP_CREATE
The next two enums are bpf_map_type (link) and, missing from the bpf man page, the full list of program types in bpf_prog_type (link).
Not all program types are loaded in the same way. How you load a bpf program is typically with a “frontend” loader. BCC and the bpf samples in the kernel are frontends which use libbpf as a common “backend”. The iproute2 package provides the ip command frontend which is used to attach XDP and tc programs to network interfaces. The Cilium docs provide great examples of this. I’m not sure if ip uses libbpf as a backend. Ultimately, all roads lead to the bpf/3 system call.
bpf helper functions
A 1730 line comment documents various bpf helper functions that are available. Not all of them are defined in this file, nor are all the available helper functions documented here. Some of them are only available if you set the license of your bpf program to GPL.
For example, bpf_getsockopt (link), is available to non-GPL bpf programs. Notice .gpl_only = false
bpf_skb_event_output (link), by contrast, is only available to GPL-licensed bpf programs.
The helper functions used, as well as the header files included have important licensing implications for bpf programs you might write. As I understand it the Linux Kernel stayed with GPL 2.0 so the bpf programs licensed “GPL” are implicitly the 2.0 variety, not 2.x or 3.0 or others. I refer you to the GPL FAQ, COPYING, and the Linux-syscall-note for more info. tl;drLegal is a good source for general licensing info as well. None of the above should be interpreted as legal advice and is not a substitute for legal advice.
struct __sk_buff
Notably, you can’t use struct sk_buff (defined in include/linux/skbuff.h) in bpf programs, but there’s a struct __sk_buff (link) alternative. Other structs in headers in the very same include path can be pulled in, such as struct sock defined in include/net/sock.h. Again, I’m not a kernel developer or it would probably be obvious why sk_buff is special in this regard. More context can be found in Alexei Starovoitov’s patch that first introduced struct __sk_buff
Most importantly, __sk_buff can be substituted for sk_buff in kernel function signatures that we’d like to probe, such as
int dev_queue_xmit(struct sk_buff *skb);A bpf kprobe of dev_queue_xmit loaded by libbpf would look like this
SEC("kprobe/dev_queue_xmit")
int any_name(struct __sk_buff *skb); // __sk_buff, not sk_buffSEC is covered next.
tools/testing/selftests/bpf/bpf_helpers.h
SEC (link) is a macro that expands to create an ELF section which bpf loaders such as libbpf parse.
For XDP and tc programs, ip looks for the ELF section called “prog” by default. You can optionally specify a section name when running ip
libbpf defined the “kprobe/” convention you saw above, and as you’ll see below.
Other SECtions include “maps” which is included in struct bpf_map_def variables, “license”, and “version” which is used to compare the running kernel version with the kernel version compiled against to reject mismatches since kprobes don’t guarantee ABI compatibility.
The other notable macros in this file are the PT_REGS_* (link) for portably accessing registers on the architecture-dependent struct pt_regs which can be the context for BPF_PROG_TYPE_KPROBE and BPF_PROG_TYPE_TRACEPOINT.
tools/testing/selftests/bpf/bpf_endian.h
bpf_endian.h is worth a mention because I’ve used it in nearly every bpf program I’ve written to convert CPU architecture endianness to network byte order and back.
User space headers and tools
The previous section was specific to the Kernel and bpf programs themselves. Most bpf programs I’ve seen or imagine writing will have a user space counterpart to modify the bpf program behavior, extract data, etc. All of this is done via maps which we’ll see in the examples section.
tools/lib/bpf/libbpf.h
libbpf provides a level of abstraction over the bpf/3 system call. No abstraction means using the union bpf_attr, enum bpf_cmd, and more found in include/uapi/linux/bpf.h directly. I certainly don’t want to define individual struct bpf_insn; it’s like writing assembly by hand, which some have done.
The user space distinction is important because bpf_map_def (kernel, user space) and bpf_map (kernel, user space) have “duplicate” definitions in kernel headers we’ve already seen. They won’t collide if you use them in their respective kernel/user space but they can be easy to mix up in your head at first.
I mentioned that libbpf defines the “kprobe/” convention. That happens here along with all the other program types.
samples/bpf/bpf_load.h
I’d characterize this as both an abstraction on libbpf and a minimal framework which has some really nice features. Notice the very straightforward int load_bpf_file(char *path) (link). Like all frameworks it comes with some opinions built in. If you agree with those opinions then you have a great time with the framework; if not you can find an alternative, drop down to libbpf, or call bpf/3 directly. I was really happy when I found this because it let me focus on the bpf program I was writing while getting past compilation hurdles.
extern int map_fd[MAX_MAPS] is especially nice because bpf_load.c automatically pins maps defined in the bpf program to the bpf filesystem so they’re easily accessible in a user space program in the order they appear (e.g. the first map definition is in map_fd[0]).
I’m not sure what the relationship between kprobes and the perf system is but trying to load a kprobe bpf program with libbpf by itself didn’t work for me. I found that all the sample bpf programs were using bpf_load.h, the implementation of which ultimately calls sys_perf_event_open (link) and my program loaded and ran as I intended.
tools/bpftool/
The Cilium docs already have great coverage on bpftool so I won’t repeat any of that here.
Example programs
Build prerequisites
A compiled Kernel is required to compile the bpf program. Here’s a Guide to building the Linux Kernel.
On Fedora I installed the following packages (not all of them are build requirements):
git gcc ncurses-devel elfutils-libelf-devel bc wget bzip2 vim
openssl-devel libcap-devel clang llvm graphviz bison flex make
glibc-static glibc-devel.i686 iproute-devel binutils-devel diffutils
findutils net-tools lsof iprouteXDP bpf example
This simple program drops traffic from a particular IP address. Figure out from which IP address you want to drop traffic and turn it into its decimal form. The example drops traffic from 16909060
xdp.c
// This program is free software: you can redistribute it and/or
// modify it under the terms of the GNU General Public License as
// published by the Free Software Foundation, either version 2 of
// the License, or (at your option) any later version.// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
// General Public License for more details.// Since this file was originally published in a blog post, a copy
// of the GNU General Public License version 2 was not included but
// can be found at
// https://www.gnu.org/licenses/old-licenses/gpl-2.0.html
#include <linux/bpf.h>
#include <linux/version.h>
#include <linux/ip.h>
#include <linux/if_ether.h>
#include <bpf_helpers.h>
#include <bpf_endian.h>// iproute specifically looks for this ELF section
SEC("prog")
int xdp_drop(struct xdp_md *xdp)
{
void *data_end = (void *)(unsigned long)xdp->data_end;
void *data = (void *)(unsigned long)xdp->data; // include/uapi/linux/if_ether.h
struct ethhdr *eth = data; // needed to pass the bpf verifier
if ((void*)(eth + 1) > data_end)
return XDP_PASS; // we're only interested in IPv4 traffic
if (eth->h_proto != bpf_htons(ETH_P_IP))
return XDP_PASS; // include/uapi/linux/ip.h
struct iphdr *ipv4_hdr = data + sizeof(struct ethhdr); // needed to pass the bpf verifier
if ((void *)(ipv4_hdr + 1) > data_end)
return XDP_PASS; // check if the saddr matches
if (ipv4_hdr->saddr == bpf_htonl(16909060))
return XDP_DROP; return XDP_PASS;
}char _license[] SEC("license") = "GPL";
__u32 _version SEC("version") = LINUX_VERSION_CODE;
Compile with
clang -O2 -Wall -target bpf -c -o xdp.o \
-I <path to kernel source>/tools/testing/selftests/bpf \
-I <path to kernel source>/tools/include/uapi \
xdp.cLoad the program into a network interface (in this case eth0) with
ip link set dev eth0 xdp obj xdp.oYou can set ip -force link … to reload the program. Unload it with
ip link set dev eth0 xdp offkprobe bpf example
This program times how long it takes for TCP connections to establish by probing tcp_connect (link) and tcp_finish_connect (link)
The average of the deltas between CPU times is calculated in the user space program. I did this for program clarity, not because it’s optimal.
The internal struct sock *sk doesn’t change between the two function calls so we can take its address and use it as the key to a LRU hash. The value is the struct timings defined in kp.h
kp.h
// This program is free software: you can redistribute it and/or
// modify it under the terms of the GNU General Public License as
// published by the Free Software Foundation, either version 2 of
// the License, or (at your option) any later version.// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
// General Public License for more details.// Since this file was originally published in a blog post, a copy
// of the GNU General Public License version 2 was not included but
// can be found at
// https://www.gnu.org/licenses/old-licenses/gpl-2.0.html
#ifndef KP_EXAMPLE_H_
#define KP_EXAMPLE_H_struct timings {
__u64 t0;
__u64 t1;
};#endif // KP_EXAMPLE_H_
kp_kern.c
This is the bpf program. Compiling it is probably the biggest challenge. With the right dependencies installed and the Kernel built I was able to run
clang -v -c -O2 -emit-llvm -o kp_kern.ir -D__KERNEL__ \
-Wall \
-Wno-gnu-variable-sized-type-not-at-end \
-Wno-address-of-packed-member \
-Wno-tautological-compare \
-Wno-unknown-warning-option \
-include <path to kernel source>/include/linux/kconfig.h \
-I <path to kernel source>/arch/x86/include \
-I <path to kernel source>/arch/x86/include/generated \
-I <path to kernel source>/arch/x86/include/uapi \
-I <path to kernel source>/include \
-I <path to kernel source>/include/generated \
-I <path to kernel source>/include/uapi \
-I <path to kernel source>/tools/testing/selftests/bpf \
kp_kern.cllc -march=bpf -filetype=obj -o kp_kern.o kp_kern.ir
The generated folders and include/linux/kconfig.h were both produced during Kernel compilation.
Notice the two-step compilation, first with clang then with llc. Passing -target bpf to clang fails when pulling in headers that transitively include headers with inline assembly. In this case it’s include/linux/bpf.h that transitively uses arch/x86/include/asm/msr.h
In file included from kp_kern.c:15:
In file included from include/linux/bpf.h:12:
In file included from include/linux/workqueue.h:9:
In file included from include/linux/timer.h:6:
In file included from include/linux/ktime.h:24:
In file included from include/linux/time.h:6:
In file included from include/linux/seqlock.h:36:
In file included from include/linux/spinlock.h:51:
In file included from include/linux/preempt.h:81:
In file included from arch/x86/include/asm/preempt.h:7:
In file included from include/linux/thread_info.h:38:
In file included from arch/x86/include/asm/thread_info.h:53:
In file included from arch/x86/include/asm/cpufeature.h:5:
In file included from arch/x86/include/asm/processor.h:21:
In file included from arch/x86/include/asm/msr.h:67:
/arch/x86/include/asm/atomic.h:167:13: error: invalid output
constraint '+q' in asm
return i + xadd(&v->counter, i);
^
/arch/x86/include/asm/cmpxchg.h:234:25: note: expanded from macro
'xadd'#define xadd(ptr, inc) __xadd((ptr), (inc), LOCK_PREFIX)
^/arch/x86/include/asm/cmpxchg.h:233:32: note: expanded from macro
'__xadd'#define __xadd(ptr, inc, lock) __xchg_op((ptr), (inc), xadd, lock)
^
/arch/x86/include/asm/cmpxchg.h:48:13: note: expanded from macro
'__xchg_op'
: "+q" (__ret), "+m" (*(ptr))
^
That’s a pretty gnarly error that wallpapered my terminal and took me 2 long days to figure out — in part because I had first worked with the XDP program above. The additional include paths introduced the error but I had no reason to think -target bpf was the culprit. This issue also appears on clang versions lower than 9.0 even without -target bpf
Ok, on to the source:
// This program is free software: you can redistribute it and/or
// modify it under the terms of the GNU General Public License as
// published by the Free Software Foundation, either version 2 of
// the License, or (at your option) any later version.// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
// General Public License for more details.// Since this file was originally published in a blog post, a copy
// of the GNU General Public License version 2 was not included but
// can be found at
// https://www.gnu.org/licenses/old-licenses/gpl-2.0.html
#include <linux/bpf.h>
#include <linux/version.h>
#include <net/sock.h>
#include <bpf_helpers.h>#include "kp.h"struct bpf_map_def SEC("maps") sock_est = {
.type = BPF_MAP_TYPE_LRU_HASH,
.key_size = sizeof(__u64),
.value_size = sizeof(struct timings),
.max_entries = 1024,
};SEC("kprobe/tcp_connect")
int connect(struct sock *sk)
{
int err; struct timings ts = {
.t0 = bpf_ktime_get_ns(),
.t1 = 0
}; // sk pointer
__u64 skp;
if ((err = bpf_probe_read(&skp, sizeof(__u64), &sk))) {
char log[] = "bpf_probe_read %d\n";
bpf_trace_printk(log, sizeof(log), err);
return 1;
} // note: map access is via the pointer to sock_est
if ((err = bpf_map_update_elem(&sock_est, &skp, &ts, BPF_ANY))) {
char log[] = "bpf_map_update_elem %d\n";
bpf_trace_printk(log, sizeof(log), err);
return 1;
} return 0;
}SEC("kprobe/tcp_finish_connect")
int finish_connect(struct sock *sk)
{
int err; __u64 skp;
if ((err = bpf_probe_read(&skp, sizeof(__u64), &sk))) {
char log[] = "bpf_probe_read %d\n";
bpf_trace_printk(log, sizeof(log), err);
return 1;
} struct timings* t = bpf_map_lookup_elem(&sock_est, &skp);
if (t && t->t0) {
t->t1 = bpf_ktime_get_ns();
bpf_map_update_elem(&sock_est, &skp, t, BPF_ANY);
}
return 0;
}char _license[] SEC("license") = "GPL";
__u32 _version SEC("version") = LINUX_VERSION_CODE;
kp_user.c
This is the user space program that will load the bpf program and interact with it via the map defined in kp_kern.c
When compiled I did it in a Docker container. The program and dependencies would be moved so I put libbpf in the same folder as kp_user.c and set rpath to ./. Adjust all of this as needed
clang -O2 -Wall -Wno-unused-variable -o kp_user \
-I <path to kernel source>/tools/lib \
-I <path to kernel source>/tools/perf \
-I <path to kernel source>/tools/include \
-I <path to kernel source>/samples/bpf \
-Wl,--library-path=./,-rpath=./,--library=bpf,--library=elf \
kp_user.c <path to kernel source>/samples/bpf/bpf_load.cThe map’s file descriptor is in maps[0] which was automatically populated as part of load_bpf_file
// This program is free software: you can redistribute it and/or
// modify it under the terms of the GNU General Public License as
// published by the Free Software Foundation, either version 2 of
// the License, or (at your option) any later version.// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
// General Public License for more details.// Since this file was originally published in a blog post, a copy
// of the GNU General Public License version 2 was not included but
// can be found at
// https://www.gnu.org/licenses/old-licenses/gpl-2.0.html
#include <stdlib.h>
#include <unistd.h>
#include <sys/resource.h>#include <bpf/libbpf.h>
#include <bpf_load.h>#include "kp.h"int main(int argc, char const *argv[]) { // raise the rlimit or see
// failed to create a map: 1 Operation not permitted
// when load_bpf_file is run
int ret;
struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
if ((ret = setrlimit(RLIMIT_MEMLOCK, &r))) {
printf("setrlimit %d\n", ret);
return ret;
} if (load_bpf_file("kp_kern.o")) {
printf("%s\n", bpf_log_buf);
return 1;
} while (1) {
__u64 next_key = 0;
__u64 key = 0; unsigned long long sum = 0;
int total = 0;
int err = 0;
while (bpf_map_get_next_key(map_fd[0], &key, &next_key) == 0) {
struct timings t = {}; if ((err = bpf_map_lookup_elem(map_fd[0], &next_key, &t))) {
printf("bpf_map_lookup_elem failed %d\n", err);
break;
} if (t.t1) {
sum += t.t1 - t.t0;
++total;
} key = next_key;
} if (sum) {
printf("avg: %llu ns\n", sum/total);
} sleep(2);
} return 0;
}
