eBPF and its capabilities

Margulan
Exness Tech Blog
Published in
11 min readDec 16, 2022

Discover modern GNU/Linux kernel capabilities useful for monitoring, observability, security, performance engineering, and profiling using eBPF. You will find a few use cases and information related to the internals of BPF.

1. Context and motivation

IT technologies evolve daily, becoming ever more complex. The challenges we faced a decade ago remain today but now are burdened by layers of abstraction. What was relatively doable yesterday, today comes with implementation difficulties. Given this, we at Exness are also concerned about modern technical and security-related challenges. In this article, we provide a brief explanation of eBPF, Linux kernel internals, and insights from our conducted R&D activities.

Since GNU/Linux is widely adopted, we believe our experience is relatable and can be universally applied across enterprises and SOHO.

1.1 What is eBPF

eBPF stands for ‘enhanced Berkeley Packet Filter’, which is basically an extended version of BPF. What is BPF then? BPF originally was written in 1992 to analyze network traffic. It still remains in active usage and provides a raw interface to process network packets across network devices and kernels. BPF is a virtual machine inside of the Linux kernel, which runs BPF code inside kernel mode. If you’ve ever used tcpdump before, you’ve already used BPF as well.

1.2 How it all started

A long time ago there was a company named Sun Microsystems, which had an extremely advanced operating system for its time, called Solaris (SunOS). It brought lots of technologies to the table, such as zones, ZFS, per file audit, secure sandboxes, and DTrace.

DTrace(1) was a versatile and flexible technology that provided system observability, performance profiling, and security monitoring. Unfortunately, after Sun Microsystems’ acquisition by Oracle, the license of the source code was changed, so no one could use Sun’s technologies without consent from Oracle Corp.

After the loss of Solaris, their community tried to figure out how to rebuild the tooling which was very popular among enterprises and high-load systems. There have been several attempts(2), which, frankly, did not succeed.

In Linux 3.18, eBPF was initially included in the mainline, which can be considered the birth of eBPF.

2. eBPF and its capabilities

eBPF is a bleeding-edge technology embedded into the Linux kernel that can run sandboxed programs in kernel space (e.g. ring-0). It is used to enhance and extend the kernel capabilities without loading any additional kernel modules and without any kernel recompilation in a safe and secure manner.

If there is any place inside a computer to implement security, networking, monitoring, and profiling functionality — it would be an OS kernel. On the other hand, kernels are conservative, since they are a mission-critical part of any working system. This slows down the development of new features inside of any kernel, not to mention the security implications and risks that newer features could potentially bring to the stage.

eBPF changes the game by bringing the ability to run sandboxed programs within the kernel. Developers can now easily extend capabilities without writing kernel drivers and modules. eBPF subsystem guarantees safety and stability by using JIT (Just-in-Time) compiler and bytecode verification engine.

Figure 1: the common architecture of eBPF. (eBPF.io, 2022)

eBPF brought with it a vast amount of software, including software-defined networking (SDNs), observability projects, and security-based software. It also covered lots of domains and use cases: providing high-performance packet processing, load-balancing, hooking to critical system calls, debugging running software, and more. Basically, eBPF can be referred to as a Linux superpower that allows you to unleash your endless creativity to solve tasks of any level of complexity.

2.1 General usages

Use cases can be summarized in a few groups:

  • Security
  • Software profiling & tracing
  • Networking (XDP)
  • Monitoring

Let’s cover some of them in more detail.

2.2 Security

eBPF allows the user to intercept any system calls, aka syscalls, alongside the ability to process every network packet and gather socket-level information on all network operations. This enables a very high-performance and revolutionary way of building security systems and monitoring. Yet another selling point is having any information on syscalls, network events, hardware IRQs, etc available to the user in one place, so there is no urge to bring a zoo of completely different technologies to cover one or the other security domain.

2.3 Software profiling and tracing

eBPF has the ultimate ability to attach to any running process in the system, gaining access to stack, heap, and even function variables inside the running OS. There are kernel probes, user probes, and tracepoints as well, which allows the user to hook up to them and provides an unobstructed view of what happens in the runtime. eBPF also has built-in statistical data structures which can be used to extract debugging and tracing data without the need to export given data anywhere else.

2.4 Networking

BPF naturally was designed to process network packets in a streaming mode. eBPF has extended its possibilities and makes a natural fit for packet processing in any kind of networking solution. eBPF allows the user to build a processing pipeline in almost no time, which can be further extended by packet parsers and network logic. The cherry on top here is that processing happens even before the network packet itself reaches the Linux network subsystem. For instance, that’s how Cloudflare mitigates DDoS attacks and how Meta routes its traffic.

2.5 Monitoring

Old-fashioned monitoring solutions heavily relied on static tracepoints and some OS hooks which were exposed to the public. Instead, eBPF allows the in-kernel aggregation and collection of custom metrics based on various event sources. This enables the end-user to better understand what activities undergo hidden from human eyes and typical monitoring. For instance, the generic way to monitor an IO system is to gather IOPS metrics and, based on the output, understand if your disk arrays are in a degraded state. A better way to understand what is going on is to monitor multiple levels of abstraction from the application level to hardware interrupts on your physical disk device. And this is what eBPF stands for.

3 How does it work?

In this chapter, we will briefly cover eBPF basics. If you would like to learn more about eBPF, please see eBPF & XDP Reference Guide.

3.1 What is a hook?

eBPF uses an event-driven approach, and the eBPF program gains execution flow on some defined event (e.g. hook), such as syscalls, kernel tracepoints, function calls, network events, and so on.

Figure 2: Hooking to execve() syscall

Hooks have an extensive structure, so the end-user can extend the hook list by creating custom kernel probes (kprobe) or user probes (uprobe) to attach anywhere in the kernel, or in the userland application.

3.2 eBPF architecture

There are multiple ways to invoke and run eBPF. Most eBPF use cases are indirect, e.g. running software that uses eBPF, such as Teleport, bcc or bpftrace, tcpdump, etc. Nonetheless, the end-users can write in BPF assembly, and use pseudo-C or some high-level language bindings to write eBPF programs. There is also a target for LLVM to produce BPF bytecode from C-like source.

Figure 3: Clang can compile eBPF bytecode

At the time the target hook is defined, the eBPF application will load into the kernel using a special BPF system call. Usually, this is done via one of the available eBPF libraries.

Figure 4: An overview of how eBPF program goes from source to a bytecode

As you can see in Figure 4, before loading BPF bytecode into the kernel virtual machine, it has to go through the following steps:

  1. Verification
  2. Just-in-Time compilation

Verification

BPF universal in-kernel virtual machine guarantees memory safety and system stability. To achieve this, any compiled bytecode must ensure that the eBPF program is safe to run by checking that the application:

  • doesn’t halt or crash the kernel;
  • has an exit condition (e.g. no while true statements);
  • has the right CAPs and enough privileges.

Just-in-Time compilation

This step compiles the generic BPF bytecode into the machine-specific assembly to speed up the execution. JIT makes it possible to run eBPF applications as fast as natively compiled kernel code, or as the code loaded as a kernel module.

4 Usage experience

We, at Exness, have an extensive infrastructure. We need to ensure the infrastructure is safe and properly monitored. We hear about many cases of successful eBPF usage experiences in various companies from Cloudflare to Meta. So let’s discover what is possible with eBPF in a production environment. Figure 5 highlights what is basically possible with eBPF from a hardware level, up to the application level.

Figure 5. eBPF tooling overview (Gregg, 2019)

The lists you see pointing to each block are usually bpftrace(3) or bcc(4) programs, which are ready for your use. They are a good starting point to try eBPF on your cases. Just install bpftrace and bcc in your Linux distribution, and you are good to go to visit /usr/share/{bcc, bpftrace}/.

Let’s say we need to monitor outgoing TCP connections. We can do it by hooking on the kernel probe — tcp_connect. Run either tcpconnect.bt from bpftrace or tcpconnect from bcc.

Figure 6: tcpconnect is monitoring outgoing TCP connections

As another example, let’s sniff the SSL-encrypted traffic without altering certificates. This can be achieved by hooking to encryption/decryption functions.

Figure 7: MitM-ing traffic without compromising CA-chain

Since eBPF is very efficient, it outperforms every security tool which uses any different mechanism to log, intercept and monitor system activity.

Figure 8: Monitoring file executions

Our goal is not to highlight every capability eBPF has, so let’s move to the practical part, where we will highlight some of our use cases.

5 eBPF @ Exness

Exness uses a very modern technology stack, which brings versatility and flexibility in terms of operations. We use Kubernetes, HA-clusters, and distributed databases. All the perks we can be grateful to the IT industry for in 2022. However, everything comes with a price. We pay for the gained advantages with complex infrastructure and many abstraction layers.

As a security team, we need to monitor system activity on multiple layers from the OS level up to container activity. Each overlay has its own quirks and nuances. Given the constraints, after extensive market research, we’ve come to realize that there is no perfect match for our goals.

The team has decided to move on with the closest and easiest-to-maintain solution. It was Tetragon(5). The announcement post says:

Tetragon is a powerful eBPF-based security observability and runtime enforcement platform…

Which is true. Tetragon is self-sufficient software, which is grown in the depths of the Cilium Project(6). Tetragon was a solution of choice because of: its extensibility, power of the eBPF, being written in Golang, and container-aware out of the box.

We’ve forked Tetragon to extend its functionality to fit it into our internal security requirements.

On every process_exec we calculate a binary hash and make cache (see Figure 9) to avoid repetitive hash calculations and extra CPU load.

Figure 9: Event on process execution. A binary hash is shown.

Tetragon is container-aware out of the box, however, it wasn’t designed to get container metadata from the host machine. At the same time, we had the plan to run Tetragon both on host machines and as a sidecar in Kubernetes deployments, so we’ve implemented container metadata gathering, and enriched events from those containers.

We’ve used the Go client for the Docker engine. There is a method called ContainerInspect, which returns the container information.

After getting container metadata, we can walk through the container file system and get any useful information, such as a uid, system defaults, software bill of materials, and so on.

Figure 10: Container metadata collected by Tetragon on the container host machine.

Exness infrastructure generates hundreds of gigabytes of events daily. Tetragon can use filtering mechanisms either on the BPF program level or on the output filter level. We neglected both options since we deal with gigabytes per minute or even per second. That’s why we’ve brought our own kernel event filter. Now we filter a huge pile of events in Tetragon, right before sending them to our event collection server.

Figure 11: Customizable event filter.

One of the issues we’ve faced during implementation is that some of the data received from eBPF was in human-unreadable format on several probes. For instance, do_sys_open call accepts int as the third argument in the function call. So we needed to convert file open flags from int to some normal flags, such as O_CREAT, O_APPEND, etc.

Figure 12: Linux do_sys_open call’s source code.

For this purpose, we convert such flags right in event processing in Tetragon, as it’s illustrated in Figure 13.

Figure 13: Converting of file open flags to normal representation.

In the same fashion, we implemented the processing of TCP states for tcp_set_state probe and for collecting access permissions bits through sys_fchmodat probe.

NB: Destination port is Big Endian, so it must be flipped before processing. This can be done by: dport = (dport >> 8) | ((dport << 8) & 0x00FF00)

We can’t disclose some of the changes and additions we’ve made to Tetragon yet, but we hope that we were able to present to you the common approach and availability of eBPF technology for everyone.

Conclusion

It is barely possible to fit every aspect of technology as nuanced as eBPF into an article of this scale. Nonetheless, we do believe that it could be a marvelous starting point for your enthralling journey into the depth of the Linux kernel and its internals. Drop us a line if you feel enthusiastic and want some help with your further dive-in.

Contacts

Do not hesitate and message us via the following email: security@exness.com.

Acknowledgments

This work would not have been possible without the support of the Exness Security Team. We are especially indebted to colleagues who have been supportive of our goals and who worked actively to provide us with the protected R&D time to reach those goals.

We are grateful to all of those who we had the pleasure of working with on this and other related projects. Each of our team members has provided us with extensive personal and professional guidance, and taught us a great deal about both professional life and personal life in general.

We would like to express special gratitude to Daniil Orlov, our talented Infrastructure Security Engineer, whose passion drove us to where we are all now.

References

  1. Bryan Cantrill and Brendan Gregg measuring IO latency using DTrace
  2. Red Hat’s SystemTap
  3. IO Visor bpftrace project
  4. IO Visor BCC project
  5. Tetragon repository
  6. Cilium project

--

--