Getting Started on Kubernetes observability with eBPF
From Netflix extracting network flow insight in near real time, to Cloudflare Live-patching security vulnerabilities inside the Linux kernel and building “Magic Firewall” to mitigate DDoS attacks, eBPF has become a crucial part in maintaining modern infrastructure.
In this guide, we will go through all the steps that are required to build and deploy our own eBPF probe into a Kubernetes cluster to collect insights about TCP data transfers, which can be later used to dynamically build a service throughput and dependency graph.
What is eBPF?
Linux eBPF is a set of kernel features that allow the creation and execution of BPF programs within the Linux kernel at runtime. BPF programs can be used for a variety of purposes, such as filtering network traffic, tracing kernel function calls, and more.
eBPF does to the Linux kernel what JavaScript does to HTML.
There are many projects like cilium and pixie which fundamentally change how we think about networking, security, and observability by using eBPF along with libraries like BCC and libbpf, which help us to easily leverage this technology for our own benefit.
Why eBPF for observability?
One of the main challenges in monitoring backend applications is that we are often at the mercy of the developer of the application to provide visibility into its key vitals such as throughput and latencies. If it’s a third-party application, we have little to no say in its observability capabilities.
With eBPF, we can win back this power by building a wall around the application and carefully inspecting everything that goes in and out of the wall’s gates. With that data, we can reconstruct some of the crucial observability capabilities required to operate a service reliably in a language agnostic way. At the same time, this will help application developers to focus on building a better application for end users rather than wasting time instrumenting them.
Architecture
In a typical Kubernetes setup, we have multiple nodes, each containing separate sets of pods. Within a node, all the pods share the single instance of a kernel. This design is one of the main differences between virtual machines and containerized applications.
Due to this nature, whenever a pod within a node needs to make a network call, it has to talk to the socket API within the shared kernel on the node. Using eBPF, we can trace all these API calls to the socket API and get full network visibility on a single node. This can be easily scaled to the entire cluster by using a Kubernetes DaemonSet, we can deploy this probe to every node within the cluster and capture flow data within the entire cluster.
One limitation of this approach is that the socket API only contains TCP data such as local and remote IPs and bytes transferred. This data alone won’t make much sense, especially since IP addresses within Kubernetes change frequently. To overcome this, we can cross reference pod metadata from Kubernetes API with the TCP data extracted via eBPF.
Finally, we can expose a prometheus metrics endpoint on each eBPF agent that has been deployed on each node and publish per pod metrics to it. Afterwards, a central prometheus server can be used to scrape these endpoints periodically and calculate trends over time for all the pods within the cluster.
Writing the eBPF probe
For this tutorial, we will use BPF Compiler Collection (BCC), which gives us an easy-access Python API to connect to the eBPF subsystem. The setup process for this can be found in this link.
The goal of this probe is to trace the life cycle of a TCP data transfer from start to finish and export aggregated data about that request to the userspace. According to Linux Kernel documentation, TCP sent events can be triggered from either tcp_sendmsg or tcp_sendpage.
By attaching kprobes to these two functions, we are able to get a reference to the underlying socket data structure, which contains all the TCP connection metadata such as designation IP and the port. We can also remember the time of this call which can be later used to calculate the TCP transfer time.
Since the kprobes are hooked into the start of the function call we don’t have access to the number of bytes that will be sent. This is because it gets calculated during the execution of this function call and returns to the caller. Luckily eBPF API has one more trick up its sleeve, kretprobe acts exactly the same as kprobes but they hook into the end of a function call and can read the return values of the function.
By combining data from both the kprobe and kretprobe, we are able to trace the start of the TCP transfer. Since we don’t know when this data transfer will be completed, we can store the collected data into a hashmap using BPF_HASH.
Once the data transfer is completed tcp_cleanup_rbuf will be called and it has an argument copied
which contains the number of bytes received. So by attaching a kprobe to that we can get access to that value. Then we can query the hashmap that was created earlier and get access to the data from the start of the transfer. Afterwards, we can calculate the transfer duration by subtracting the current time with the time we recorded at the start of the data transfer.
Finally, all the collected data about the TCP data transfer will be written to a ring buffer which can be read from the userspace.
Code for the completed probe can be found from https://github.com/MrSupiri/kube-ebpf/blob/v0.0.1/prober.c
eBPF + Kubernetes
Now, since we have the raw TCP data we need to infuse it with Kubernetes metadata such as pod names, so we can make sense of those in the context of Kubernetes.
Access Kubernetes metadata
kube-apiserver is a central place where all the components connect to sync up the status. Lucky for us it exposes a powerful REST API which we can use to control pretty much anything within the cluster as long as we have to correct permission.
For this project, we only need read access to the pods running in the cluster. Since we need to query the metadata about pods running in the entire cluster, we will need to create a Cluster Role with a Service Account with enough permission. The service account and cluster role can be linked using a ClusterRoleBinding. Finally, for each agent, we could provide the newly created service account in the pods specification where Kubelet will promptly create a file with JWT token on /var/run/secrets/kubernetes.io/serviceaccount/token
. With that token, we can easily poll Kubernetes API and get IPs and names of pods which in turn can be mapped to eBPF data.
A proper implementation of this which can be integrated with eBPF probe can be found from https://github.com/MrSupiri/kube-ebpf/blob/v0.0.1/kube_crawler.py
Annotating TCP data
After parsing the raw data return from the eBPF ring buffer we will end up with something like this
{
"source_ip":"192.168.1.203",
"source_port":56266,
"destination_ip":"162.159.153.4",
"destination_port":443,
"transmit_bytes":145,
"receive_bytes":39,
"duration":54.274
}
Even though this data makes sense in a VM type setups. Once plugged into Kubernetes, IP addresses change very often. So from the time of extraction to the time of reading data might be pointing to completely different services. To prevent this we could query the kube-api at the time of extraction and annotate IP data with Kubernetes metadata to achieve something like this.
{
"source":"default/pod-1-39cb975d71-twjbk",
"source_port":56266,
"destination":"kube-system/kube-dns-k8976dfcl-acyjn",
"destination_port":443,
"transmit_bytes":145,
"receive_bytes":39,
"duration":54.274
}
Exporting to prometheus
prometheus-client
library for python offers 4 types of metrics to be exported from a service. For this project we will only need a Counter and a Histogram.
# Define prometheus metricsms = Histogram("kube_ebpf_request_duration_seconds", "TCP event latency", ["namespace", "name", "port"])tx_kb = Counter("kube_ebpf_transmitted_bytes", "Number of sent bytes during TCP event", ["namespace", "name"])rx_kb = Counter("kube_ebpf_acknowledged_bytes", "Number of received bytes during TCP event", ["namespace", "name", "port"])request_sent = Counter("kube_ebpf_requests_sent", "Total request sent", ["namespace", "name"])request_received = Counter("kube_ebpf_requests_received", "Total request received", ["namespace", "name", "port"])request_exchanged = Counter("kube_ebpf_request_exchanged", "Total request exchanged between pods", ["source_namespace", "source_name", "destination_namespace", "destination_name", "destination_port"])
Once we have the metrics defined, for each record that was written to the eBPF ring buffer we can update these metrics.
def update_metrics(data):# Get kubernetes pod metadata for source and destination IPs
source = get_metadata(data['source_ip'])
destination = get_metadata(data['destination_ip'])# Request didn't happen though kubernetes managed IPs
if source is None and destination is None:
return# TCP source was from a kubernetes managed IP
if source is not None:
request_sent.labels(source['namespace'], source['name']).inc()
tx_kb.labels(source['namespace'], source['name']).inc(data['transmit_bytes'])# TCP destination was from a kubernetes managed IP
if destination is not None:
request_received.labels(destination['namespace'], destination['name'], data['destination_port']).inc()
rx_kb.labels(destination['namespace'], destination['name'], data['destination_port']).inc(data['receive_bytes'])
ms.labels(destination['namespace'], destination['name'], data['destination_port']).observe(data['duration'])
# TCP request happened between two kubernetes managed pods
if source is not None and destination is not None:
request_exchanged.labels(source['namespace'], source['name'], destination['namespace'], destination['name'], data['destination_port']).inc()
Prometheus client library will aggregate these data points properly and expose those in a way the Prometheus can easily escape and persist the data for long term analysis.
Complete implementation of this process can be found from https://github.com/MrSupiri/kube-ebpf/blob/v0.0.1/prober.py
Deploying in Kubernetes
As mentioned in the architecture section, this agent must be deployed as a DaemonSet, but in order for the agent to hook into the Linux Kernel, we need to make sure the node we are deploying to has kernel headers installed. To achieve this, we can use an Init Container, which does the pre-checks and if the headers weren’t found, download and install them before booting up the agent. fetch-linux-headers script that was adopted from mclenhard’s ebpf-summit project does exactly that.
Finally, we can combine all these into a single deployment yaml file and submit to Kubernetes API. This will spawn a Daemon Set which will hook into the Kernel of each node start scrapeable prometheus endpoint.
kubectl apply -f https://github.com/MrSupiri/kube-ebpf/blob/v0.0.1/deployment.yaml
Observe
Finally, we can deploy a prometheus server with a scrape target on a pods that has the prometheus.io/scrape: 'true'
.
kubectl apply -f https://github.com/MrSupiri/kube-ebpf/blob/v0.0.1/prometheus-deployment.yaml
If we let it run for a few minutes and log-on to its dashboard using a port-forward, we can observe the TCP data flow within the cluster.
kubectl port-forward prometheus-<POD-ID> -n monitoring 9090:9090
Conclusion
This article describes how you can leverage eBPF monitor applications run within a Kubernetes cluster in a language agnostic way. As of now this only captures and stores the raw telemetry data, but these data can be further processed to derive service throughput and dependency graphs which will make the lives of SREs much easier.
A working prototype of the application can be found by following the link given below. In a later article, we will explore how we can integrate AIOps on top of eBPF to achieve real-time anomaly detection.