BPF and Go: Modern forms of introspection in Linux

Marko Kevac
Dec 15, 2020 · 25 min read

Everyone has their favourite book about magic. For one person it is Tolkien, for another Pratchett, for a third, like me, it is Max Frei. Today I am going to tell you about my favourite IT magic: BPF and the modern infrastructure surrounding it.

BPF is currently at the height of its popularity. The technology is striding ahead by leaps and bounds, reaching into unexpected places, and becoming ever more accessible to the ordinary user. Virtually every popular conference nowadays has a talk on this subject, and back in August I was invited to give a talk at GopherCon Russia.

I had such a good experience there that I wanted to share about it with as many as possible. This article will give you the context for why we need something like BPF, help you understand when and how to use it and how it can help you as an engineer and improve the projects you are working on. We will also look at its specifics in relation to Go.

What I really want is for your eyes to start glowing after reading this article like those of a small child after reading Harry Potter for the first time, and for you to go and try out this new “toy” for yourself.

· A little bit of context
· What is eBPF?
· Where is BPF used?
· API: How to use it
Basic usage
bcc
· Examples of the use of BPF
And what about Go?
· Analysing programs in Go
Sending arguments
Unique thread identifier
Extending the stack
Examples
· Conclusion

A little bit of context

Okay, what is this magic you are going to be told about by a 34-year-old bearded guy with a burning look in his eyes?

We are living in the year 2020. Open Twitter and you can read tweets by angry techies all saying that the quality of the software being written today is so awful that it all needs throwing away and that we need to start all over again. Some people are even threatening to leave the profession altogether because they just can’t stand how everything breaks and is inconvenient and slow.

They may be right: there is no way of defining why without consulting a thousand comments. But one thing I definitely agree on is that a modern software stack is more complex than ever: we’ve got BIOS, EFI, the operating system, drivers, modules, libraries, network interaction, databases, caches, orchestrators (like K8s), Docker containers, and, finally, our own software with runtimes and garbage collection.

A true professional could spend days answering the question as to what happens after you type google.com into your browser.

To understand what is going on in your system is very complicated, especially if, at the present time, things are going wrong and you are losing money. This problem is what led to the emergence of businesses offering to help you work out what is going on inside your system. At large companies, there are entire departments of Sherlock Holmes-type detectives who know just where to tap the proverbial hammer, and what proverbial bolt to tighten in order to save millions of dollars.

I like to ask people how they would debug unexpected problems in the shortest amount of time. Most often, the first approach that comes to mind is to analyse logs. But the problem is that the only logs accessible are the ones which the developer put in their system, and that is not flexible.

The second most popular approach is to study metrics. The three most popular systems for working with metrics are written in Go. Metrics are very helpful, however, while they do allow you to see the symptoms, they do not always help you to define underlying causes.

The third is what is called ‘observability’: where you can ask as many complicated questions as you like about the behaviour of the system, and to obtain answers to those questions. Since the questions can be very complex, the answers may require the widest variety of information and, until the question has been asked, we don’t know what that information is. And this means that observability absolutely demands flexibility.

What about providing an opportunity to change the level of logging ‘on the fly’? What about, with a debugger, connecting up to a program, while it is operating, and doing something, without interrupting the program as it is working? What about understanding what queries are sent to the system, visualising the sources of slow queries, seeing via pprof what is using up memory, and obtaining a graph of its change over time? What about measuring the latency of one function and the latency’s dependency on arguments? I would group all these approaches under the umbrella term observability. This is a set of utilities, approaches, knowledge and experience which, together, give us the chance, if not to do whatever we want, then at least to do a lot ‘live’, right there in a system, while it is working. It’s the modern-day IT equivalent of a Swiss Army knife.

But how can we achieve this? There have been, and still are, lots of tools available on the market: simple, complex, dangerous and slow ones. But today’s article is about BPF.

Linux kernel is an event-driven system. Practically everything that goes on in the kernel, and in the system overall, can be thought of as a set of events. An interrupt is an event; receiving a packet over the network is an event; transferring control of a processor to another process is an event; running a function is an event.

Right, so BPF is a subsystem of the Linux kernel which gives you the opportunity to write small programs which will be run by the kernel in response to events. These programs can both cast light on what is going on in your system and can control it.

Now let’s get down to the nitty-gritty.

What is eBPF?

The first version of BPF came out in 1994. Some of you will probably have come across it when writing simple rules for the tcpdump utility intended for viewing or ‘sniffing’ network packets. You could set filters for tcpdump, so you didn’t have to view everything — just the packets you were interested in. For example, “only tcp protocol and only port 80”. For each passing packet a function would run to determine whether or not you needed to save the particular packet in question. There can be very many packets, and so our function has to be fast. Indeed, our tcpdump filters were transformed into BPF functions. Here’s an example.

The original BPF represented a very simple virtual machine with several registers. But, nevertheless, BPF significantly sped up the filtering of network packets. In its time this was a major step forward.

In 2014 Alexei Starovoitov, a very prominent kernel hacker, extended the functionality of BPF. He increased the number of registers and the permissible size of the program, added JIT compilation and created a checker which checked the program was secure. However, the most impressive thing was that the new BPF programs were able to run not only when processing packets, but also in response to other kernel events, and passed information back and forth between the kernel and the user space.

These changes opened up opportunities for new ways of using BPF. Some things which used to be implemented by writing complicated and dangerous kernel modules, could now be done relatively simply via BPF. Why was this so great? Because any error when writing a module often led to panic, not to ‘fluffy’ Go-style panic, but to kernel panic, after which the only thing to do is reboot.

The ordinary Linux user suddenly had a new superpower: being able to look ‘under the bonnet’ — something previously only available to hardcore kernel developers, or not available to anyone at all. This option can be compared with the ability to write a program for iOS or Android with very little effort: on old phones, it was either impossible or significantly more complicated.

The new version of BPF from Alexei Starovoitov was called eBPF (e for extended). But now it has replaced all the old usages of BPF and has become so popular that for simplicity’s sake it is just called BPF.

Where is BPF used?

Okay, what are the events or triggers to which you can attach BPF programs, and how did people start to use their newly-acquired power?

At present, there are two major groups of triggers.

The first group is used for processing network packets and for managing network traffic. These are XDP, traffic control events and several others.

These events are required for:

  • Creating simple but very effective firewalls. Companies such as Cloudflare and Facebook use BPF programs to filter out a huge volume of parasitic traffic and to combat the largest-scale DDoS attacks. Since processing takes place at the earliest stage of a packet’s life, directly in the kernel (a BPF program is sometimes even pushed straight to the network card for processing), colossal volumes of traffic can be processed in this way. These things used to be done on specialised network hardware.
  • Creating smarter, targeted, but nevertheless very performant firewalls — ones that can check passing traffic for compliance with company rules, for vulnerability patterns and so on. Facebook, for example, carries out this audit out in-house, while some projects sell products of this kind externally.
  • Creating smart balancers. The most outstanding example is the Cilium project, which is most often used as a mesh network in a K8s cluster. Cilium manages traffic, balancing, redirecting and analysing it. And all this is done with the help of small BPF programs run by the kernel in response to this or that event related to network packets or sockets.

This was the first group of triggers connected with network issues and capable of influencing behaviour. The second group is related to more general observability; most often programs in this group are incapable of influencing anything but rather are only able to ‘observe’. This is what interests me far more.

In this group, there are triggers such as:

  • perf events — events connected with performance and with the perf Linux profiler: hardware processor counters, processing of interrupts, interception of minor/major memory exceptions, and the like. For example, we can set up a handler which will run every time the kernel needs to read a memory page from swap. Imagine, for example, a utility which displays the programs currently using swap.
  • tracepoints — static (defined by the developer) locations in the kernel’s source code, from which you can extract static information (information prepared earlier by the developer) by attaching to them. It might seem that in this case being static is a bad thing, since I said that one of the drawbacks of logs is that they only contain what the programmer originally put there. In one sense that is right, but tracepoints have three important advantages:
    - There are quite a lot of them strewn around the kernel in the most interesting locations
    - When they are not ‘on’, they use no resources
    - They are part of API, they are stable and they do not change.This is very important, since other triggers we will mention lack stable API.
    For example, imagine a utility displaying programs, which the kernel for one reason or another does not give time to execute. You sit and wondering why it is being so slow, while pprof has nothing interesting to show.
  • USDT — is the same thing as tracepoints, but for user space programs. That is to say, as a programmer, you can add these locations to your program. And lots of large-scale and well-known programs and programming languages have already adopted these traces: MySQL, for example, or the languages PHP and Python. Often their default setting is ‘off’ and to switch them on you need to rebuild the interpreter with the — enable-dtrace parameter or similar. Yes, we also have the option of registering these kinds of trace in Go. You might have recognised the word DTrace in the name of the parameter. The point is that these kinds of static traces were popularised by the system bearing the same name, which was born in the Solaris OS. As an example, imagine being able to tell when a new thread is created, when a GC, or something else related to a specific language or system, is launched.

This is where another level of magic begins:

  • Ftrace triggers give us the option of running a BPF program at the start of practically any function of the kernel. Entirely dynamically. This means that the kernel will call your BPF function before any kernel function you select begins to execute, or all the kernel functions — whichever. You can get attached to all the kernel functions and obtain, at output, an attractive visualisation of all the calls.
  • kprobes/uprobes give you almost the same thing as ftrace, but you have the option of attaching to any location when executing a function in both the kernel and the user space. If, in the middle of the function, there is an ‘if’ on a variable, and you need to build a histogram of values for this variable, that is not a problem.
  • kretprobes/uretprobes — here everything is analogous to the previous triggers but can be triggered when the kernel function or function in user space returns. These kinds of triggers are convenient for viewing what the function returns, and for measuring the time it takes to execute. For example, you can find out which PID was returned by the ‘fork’ system call.

The most wonderful thing about all this, I repeat, is that having been called in response to any of these triggers our BPF program can have a good ´look around´: reading the function’s arguments, recording the time, reading variables, reading global variables, taking a stack trace, saving something for later, sending data to the user space for processing, and/or obtaining data or some other control commands from the user space for filtering. Marvellous!

I don’t know about you, but for me, this new infrastructure is like a toy I have wanted to get my hands on for a long time.

API: How to use it

Okay, Marko, you have convinced us to take a look at BPF. Now how can we take a closer look?

Let’s see what the BPF program comprises, and how to interact with it.

Firstly, we have a BPF program which, if it passes verification, will be loaded into the kernel. There it will be compiled by JIT compiler into machine code and run in kernel mode, when the trigger that is attached, will activate.

The BPF program has the option of interacting with the second part i.e. with the user space program. There are two ways this can happen. We can write to the circular buffer, and the user space part can read from it. We can also write and read to the key-value map, which is called the BPF map, and the user space part, correspondingly, can do the same thing, and, correspondingly, they can pass information to each other.

Basic usage

The simplest way of working with BPF, which you should never under any circumstances start with, consists of writing BPF programs with the C language, and compiling the code in question, with the Clang compiler, into code for the virtual machine. We then load this code, using the BPF system call directly, and interact with our BPF program, also using the BPF system call.

The first simplification available is to use the libbpf library. This is supplied along with the kernel’s source code and allows you to work directly with the BPF system call. Basically, it provides convenient wrappers for loading code, and for working with BPF maps for sending data from the kernel to the user space and back.

bcc

Clearly, this is far from convenient for people. Fortunately, under the brand name of iovizor, the BCC project has appeared, and this is making our life easier.

Basically, it prepares the whole build environment and allows us to write single BPF programs, where the С-part will be built and loaded into the kernel automatically, while the user space part can be made in Python, which is simple and clear.

bpftrace

However, BCC seems complicated for lots of things. For some reason, people particularly do not like writing parts in С.

Those same people from iovizor have provided a tool, bpftrace, which allows you to write BPF scripts in simple script language similar to AWK (or even one-liners).

Brendan Gregg, a well-known specialist in the field of productivity and observability, has produced the following visualisation of the available ways of working with BPF:

The vertical axis shows the ease with which a given tool can be used, while the horizontal axis shows its capability. You can see that BCC is a very powerful tool but it is not super simple. bpftrace is much simpler, but at the same time, being a bit less powerful.

Examples of the use of BPF

However, let’s take a look at some specific examples of the magical powers which have become available to us.

BCC and bpftrace both contain a “tools” directory containing a huge number of interesting and useful, ready-to-use scripts. They can also be used as local Stack Overflow, from which you can copy chunks of code for your own scripts.

Here, for example, is the script which shows latency for DNS queries:

A utility in real time shows the completion time for DNS queries, so, for example, you can catch some unexpected outliers.

And this is a script which ‘spies’ on what others are typing into their terminals:

This kind of script can be used to catch a ‘bad neighbour’, or to perform a security audit of the company’s servers.

A script for viewing the flow of calls in high-level languages:

Here is an example showing the program’s call stack in Python.

The same Brendan Gregg has produced an image which brings together all the relevant scripts with arrows pointing to the subsystems which each utility allows you to observe. As you can see, we already have a huge number of ready-to-use utilities at our disposal — for practically any eventuality.

And what about Go?

Now let’s talk about Go. We have two basic questions:

  • Can you write BPF programs in Go?
  • Can you analyse programs written in Go?

Let’s do things in order.

Currently, the only compiler able to compile to a format understood by a BPF machine is Clang. The other popular compiler, GСС, still doesn’t have BPF backend. And the only programming language which can be compiled to BPF is a very limited version of C.

However, the BPF program also has a second part which is in the user space. And that can be written in Go.

As I already mentioned above, BCC allows you to write this part in Python, which is the primary language for the tool. At the same time, in the main repository BCC also supports Lua and C++, and, in the side repository, it also supports Go.

This program looks exactly the same as a program in Python. At the start, it has a string in which the BPF program is in C, and then we communicate where to attach a given program, and we interact with it in some way, for example, extracting data from the BPF map.

And that’s basically it. Check out the example in more detail on Github.

The main drawback is probably the fact that we are using a C library, libbcc or libbpf and building a Go program with C library is far from an easy ‘walk in the park’.

Besides iovisor/gobpf, I also found three other up-to-date projects which allow you to write the userland part in Go.

The version from Dropbox does not require any C libraries, but you will need to build the kernel part of the BPF yourself using Clang, and then load it into the kernel using Go program.

The version from Cilium has the same specifics as to the Dropbox version. But it is worth mentioning, not least because it is made by the people from the Cilium project, and that means that it can only succeed.

The third project I listed for the sake of completeness. As in the case of the two previous projects, it doesn’t have external C dependencies, requires the BPF program to be built manually in C but, it would seem, is not particularly promising looking ahead.

In fact, there is one more question we should be asking: why write BPF programs in Go at all? Because if you look at BCC or bpftrace, then BPF programs occupy less than 500 lines of code. But wouldn’t it be simpler to just write a little script in the bpftrace language, or use a little bit of Python? I can see two reasons not to.

The first is this. You do really like Go and would prefer to do everything in Go. Besides, it is potentially simpler to move the Go programs from machine to machine: static linking, simple binary, and all that. But things are far from being so straightforward since we are tied to a specific kernel. I’ll stop here, otherwise, my article is going to be another 50 pages longer.

The second reason is this. You are not writing a simple script, but a large-scale system, which also uses BPF internally. I even have an example of such a system in Go:

The Scope project looks like a binary, which, when run in the infrastructure of K8s or another cloud, analyses everything that happens roundabout, and shows what containers and services there are, how they interact and so on. And lots of this is done using BPF. An interesting project.

Analysing programs in Go

If you remember, we had another question: can we analyse programs written in Go using BPF? Our first thought was, “Yes, of course, we can!” What difference does it make what language a program is written in? After all, it is just compiled code that, like other programs, calculates something in the processor, uses up memory like crazy and interacts with the hardware through the kernel, and with the kernel through system calls. In principle this is correct, but there are specifics — and these are of varying degrees of complication.

Sending arguments

One of the specifics is the fact that Go does not use the ABI which most other languages do. The way it worked out was that the ‘founding fathers’ decided to take ABI from the Plan 9 system, a system they were very familiar with.

ABI is, like API, an interface convention — only at the level of bits, bytes, and machine code.

The main element of ABI we are interested in is how its arguments are sent to the function, and how the response comes back from the function. If in a standard ABI x86–64 the processor’s registers are used for sending arguments and a response, in Plan 9 ABI the stack is used for this purpose.

Rob Pike and his team did not set out to make yet another standard; they already had an almost ready-to-use compiler for C for the Plan 9 system — as simple as 2 x 2 — which, within very short lead time, they remade into a compiler for Go. An engineer’s approach in action.

However, in actual fact this is not such a critical issue. For one thing, we may soon see arguments being sent via registers in Go, and, secondly, obtaining stack arguments from BPF is not complicated: the sargX alias has already been added to bpftrace, and another one will most likely appear in BCC in the very near future.

Update: since I gave the talk, a detailed official proposal has even come out on moving over to using registers in ABI.

Unique thread identifier

The second specific is related to a beloved feature of Go, namely goroutines. One of the ways of measuring the latency of functions is to save the time the function was called, get the function exit time, and calculate the difference. We need to save the start time along with a key which will contain the name of the function and the TID (thread ID). The thread ID is needed, since same function can be called by different programs, or different threads of a single program, at the same time.

But, in Go, goroutines move between system threads: one minute a goroutine is being executed on one tread, the next minute, on another. And, in the case of Go, it would be good for us, instead of putting TID into the key, to put GID i.e. the ID for the goroutine — but, unfortunately, we cannot obtain it. From a purely technical point of view this ID does exist. You could even extract it using dirty hacking since it is to be found somewhere in the stack, but to do so is strictly forbidden by the recommendations of the key group of developers. They consider this to be information we will never need. The same is true for goroutine local storage — but this is going off on a tangent.

Extending the stack

The third problem is the most serious one. It is so serious that, even if we somehow solve the second problem, that is not going to help us measure the latency of Go functions.

Most readers probably have a good understanding of what a stack is. This is the same stack where, unlike heap, you can allocate memory for variables, and not have to think about releasing them.

But for C, in that case, the stack has a fixed size. If we go beyond that fixed size, the well-known phenomenon of stack overflow will occur.

In Go, the stack is dynamic. In old versions, it was implemented by linked list of memory chunks. Nowadays, it is a continuous chunk of dynamic size. This means that, if the allocated chunk is insufficient for us, we extend the current one. And if we can’t extend it, we allocate a larger one and move all the data from the old location to the new one. This is thoroughly fascinating and touches on issues of security guarantee, cgo and garbage collection, but it is a subject for a separate article.

It is important to know that in order for Go to be able to move a stack, it has to go through the call stack, and through all the pointers in the stack.

And that is where the basic problem lies: uretprobes, used for attaching BPF probe to the function return, dynamically change the stack in order to integrate a call to its handler — what is called a ‘trampoline’. And, in most cases, this changes the stack, which Go does not expect, and it leads to the program crashing. Oops!

By the way, this story is not unique to Go. The C++ stack unwinder also crashes every other time when handling exceptions.

There is no solution to this problem. As usual in these cases, the respective sides hurl entirely well-founded arguments at one another in accusation.

But if you really need to set a uretprobe, there is a way to get around the problem. How? Don’t set a uretprobe. You can set a uprobe in all the locations where we exit the function. There might be one such location — or fifty.

And this is where Go’s uniqueness plays into our hands.

Ordinarily, this sort of trickery would not work. A sufficiently smart compiler knows how to perform what is called tail call optimisation, when, instead of returning from the function and returning down the call stack, we simply jump to the start of the next function. This kind of optimisation is critically important for functional languages such as Haskell. Without it you couldn’t move an inch without stack overflow. However, with this optimisation it simply won’t be possible to find all the locations where we return from the function.

The specific is that the compiler, Go version 1.14, is not yet able to perform tail call optimisation. This means that the trick of attaching to all the explicit exits from the function works, even if it is very cumberstone.

Examples

Don’t think that BPF is useless for Go. Far from it. We can do all the other things which do not touch on the issues mentioned above. And we will.

Let’s look at some examples.

First, let’s take a simple program. Basically, it is a web server which listens on 8080 port and has a handler for HTTP queries. The handler obtains a name parameter and a year parameter from the URL, performs a check, and then sends all three variables (name, year and check status) to the prepareAnswer() function, which then prepares an answer in the form of a string.

Site check is an HTTP query, checking whether the conference site works, with the help of a channel and goroutines. The prepare function simply transforms all that into a readable string.

We will trigger our program by means of a simple query via curl:

For our first example, using bpftrace we are going to print all our program’s function calls. In this case, we will be attached to all the functions coming under ‘main’. In Go, all your functions have a symbol which takes the following form: name of packet-dot-name of function. Our packet is ‘main’ and function’s runtime would be ‘runtime’.

When I use curl, the handler, site check function and goroutine subfunction are executed, followed by the prepare answer function. Excellent!

Next, I not only want to export those functions which are executing but also their arguments. Let’s take the function prepareAnswer(). It has three arguments. Let’s try to print two ints.

Let’s take bpftrace, only this time not a one-liner, but a script. Let’s get attached to our function, and let’s use aliases for the stack arguments, as I said.

In the output, we see that we sent 2020, obtained the status 200, and once sent 2021.

But the function has three arguments. The first of these is a string. What about that one?

Let’s simply export all stack arguments from 0 to 3. And what do we see? A large number, a slightly smaller number, and our old figures 2021 and 200. What are these strange numbers at the start?

This is where it is helpful to be familiar with Go internals. If in C a string is simply an array of bytes ending in zero, in Go a string is an actual structure, consisting of a pointer to an array of bytes (incidentally, this doesn’t end in a zero) and the length.

But the Go compiler, when it sends a string in the form of an argument, unwinds this structure, and sends it as two arguments. And, so, the first strange number is indeed a pointer to our array, and the second is the length.

And sure enough: the expected string length is 22.

Correspondingly, we fix our script a bit in order to obtain the two values in question via the stack pointer register, and the correct offset, and, with the help of the integrated str() function, we export it as a string. It all works:

Let’s also take a look at the runtime. For example, I would like to know what goroutines our program launches. I know that goroutines are launched by the functions newproc() and newproc1(). Let’s get attached to them. The pointer to the funcval structure is the first argument of the newproc1() function. This has only one field, the pointer to the function:

In this case, we will use the capability for defining structures directly in the script. This is a bit simpler than playing around with offsets. We have exported all the goroutines which launch when our handler is called. And if, after that, we want to obtain the names of the symbols for our offsets, then we will be able to see our checkSite function among them. Hurray!

These examples are just a drop in the ocean in terms of the capabilities of BPF, BCC and bpftrace. With sufficient knowledge of the inner workings and experience, you can obtain almost any information from a working program without stopping it or changing it.

Conclusion

That is all I wanted to tell you and I hope it has inspired you.

BPF is one of the trendiest and most promising areas in Linux. And I am sure that in the coming years we will see lots more interesting things — not only in terms of the technology itself but also in terms of the tools and of its spread.

It isn’t too late and not everyone knows about BPF, so get learning, become magicians, solve problems, and help your colleagues. They say that magic tricks only work once.

When it comes to Go, as usual, we ended up being quite unique. We are always having some quirks, whether it’s a different compiler, or ABI, needing GOPATH, having a name you can’t Google. But I think it’s true to say that we have become a force to be reckoned with and, as I see it, things can only get better.

Bumble Tech

This is the Bumble tech team blog focused on technology and…