Tutorial: Profiling CPU and RAM usage of Rust micro-services running on Kubernetes

Erwan de Lépinau
Lumen Engineering Blog
12 min readMay 24, 2022
Thumbnail image

We love working with Rust in my team, where we develop the backend infrastructure for the Lumen® Mesh Delivery and Lumen® CDN Load Balancer technologies: it is a language that boasts great performance, brings valuable compile-time guarantees and is very expressive and feature-complete. However, when we recently needed to investigate CPU performance issues and memory leaks in one of the backend services powering our Mesh Delivery network, we were faced with a major limitation of the language: Rust does not offer built-in profiling tools as opposed to, for instance, Golang and its amazing net/http/pprof module which leverages the Go runtime to make profiling web applications a breeze.

Although language-agnostic profiling tools such as the Valgrind suite have existed for ages, they are usually a fully sandboxed environment (requiring the profiled executable to run inside it from start to finish) and incur a massive performance overhead on your app. On the contrary, we needed a tool that can easily jack into a long-running web application, collect profiling data without too much overhead and be stopped without impacting the profiled process. Due to the specificity of our use case, we soon realized that Valgrind was a no-go.

Another bump on the road comes from the environment in which we are running our micro-services: a cloud-hosted Kubernetes cluster. This raises its own set of challenges — both in terms of security and practicality — compared to profiling an app on your local machine or a regular VM. Nonetheless, we found a way to overcome these difficulties and devised an easy-to-follow step-by-step process for CPU and RAM profiling of services running on Kubernetes. Buckle up and get ready to run some profiles!

(Note 1: this tutorial still applies even if you are running your services on bare metal or on cloud-hosted VMs such as in EC2 or GCE; simply disregard the few instructions specific to Kubernetes.)

(Note 2: we are taking the example of Rust in this tutorial, but it can be applied to any compiled language that produces standalone binaries, including C and C++.)

Part 1: CPU profiling using perf

Prerequisite: enabling debug symbols

First of all, you need to turn on debug symbols when building your application. This means that the compiler will embed the function and file names in the binary executable, which will be essential when you want to inspect and analyse the results of the profile. When compiling a C or C++ program with GCC or Clang, this is achieved by passing the -g flag to the compiler. In Rust, this is accomplished by adding the following section to your Cargo.toml file:

[profile.release]
debug = true

Note that this won’t affect the performance of your final binary in any way: it is not the same thing as compiling with the --debug flag! The only negative impact on your binary will be an increased size because the debug symbols will be embedded in it, but this should be a non-issue in most cases.

Connecting to the container that you want to profile

We now assume the binary you want to profile has been compiled with debug symbols and is up and running on your Kubernetes cluster. Spin up a terminal and open a shell in the container that you want to profile:

$ kubectl exec -it <POD_NAME> -c <CONTAINER_NAME> -n <NAMESPACE> -- /bin/bash

If bash is not installed by default in your container (if you use Alpine for example), you can always replace /bin/bash with /bin/sh .

Installing perf

The profiler that you are going to use is perf: it is a standard and battle-tested CPU profiler for Linux that is easy to use. An interesting feature of perf is that you can attach it to a running process, which is going to be very practical in our use case.

perf is not installed by default on most distros, so here is how to install it:

# DEBIAN
$ apt-get update && apt-get install linux-perf
$ ln -fs /usr/bin/perf_5.10 /usr/bin/perf
# Replace "5.10" by the version that was installed for you, if different
# UBUNTU
$ apt-get update && apt-get install linux-tools-generic
$ ln -s /usr/lib/linux-tools/5.4.0–104-generic/perf /usr/bin/perf
# Replace "5.4.0-104" by the version that was installed for you, if different
# ALPINE
$ apk add --update perf
# For Alpine the symlink should be done automatically

WARNING: You need to be wary of mismatches between the version of perf that you install and the version of the Linux kernel your container runs on (which you can inspect with the command uname -r). Ideally, you would want to use a version of perf that exactly matches your kernel version, or if that is really not possible then try to match at least the major version number. In our case, our base image was Debian Buster for which the latest published version of perf on the package repositories was perf_4.19. This caused issues when trying to run it on our 5.4 Linux kernel: the profiling results were buggy and missing most symbols. Therefore we decided to switch to Debian Bullseye which allowed us to install perf_5.10, which worked like a charm with our 5.4 kernel!

Retrieving your application’s PID

You will be using the ps command for this. On most distros (including Ubuntu and Alpine) ps is installed by default, but not on Debian. To install it on Debian, run:

# DEBIAN
$ apt-get update && apt-get install procps

Now that you made sure ps is installed, you can use it to retrieve the PID of your application. This will often be PID 1, but in rare cases it may be different depending on how you defined your Dockerfile or how you started your container.

$ ps -e
PID TTY TIME CMD
1 ? 00:00:04 my-app # Indeed, our app has PID 1.
1842 pts/0 00:00:00 bash
1849 pts/0 00:00:00 ps

Running perf

You are going to use the perf record subcommand to gather a profile of your application, with the -p parameter that allows to attach to a running process using its PID.

$ perf record -p <PID> --call-graph dwarf -F <SAMPLING_FREQUENCY> -- sleep <N_SECONDS>

There are 2 parameters you can play with here:

  • Sampling frequency: a higher frequency means more data points but at the cost of more CPU interruptions, which results in decreased performance of the application being profiled. The default value is 4000 Hz.
  • Duration of the profile: the longer the profile runs, the more data points you will collect.

You want to collect at least a few hundred data points to minimize statistical variance: >1000 samples would be ideal.

When the profiling ends, you should see a message similar to this:

[ perf record: Woken up 21 times to write data ]
[ perf record: Captured and wrote 5.416 MB perf.data (672 samples) ]

You can then export the profile to a file by running:

$ perf script -- no-inline > out.perf

The --no-inline flag is optional (it allows not to expand the call stack of inlined functions) but greatly reduces the duration of the exporting operation, which may take a long time if your profile has gathered large volumes of data.

Fetching the profile for analysis on your local machine

Use kubectl cp to copy the file from the profiled container to your local environment:

$ kubectl cp <NAMESPACE>/<POD_NAME>:out.perf -c <CONTAINER_NAME> ./out.perf

The resulting file can then be used for analysis, for example in the form of a FlameGraph. Clone the FlameGraph repo and run the following commands to convert your perf profile to a FlameGraph:

$ <PATH_TO_FLAMEGRAPH_REPO>/stackcollapse-perf.pl out.perf > out.folded
$ <PATH_TO_FLAMEGRAPH_REPO>/flamegraph.pl out.folded > out.svg

And… ta-da! Here is the wonderful FlameGraph that was generated in our case:

Screenshot of the resulting FlameGraph
Resulting FlameGraph of the CPU usage of our application

For more information on how to read and interpret FlameGraphs, see the original author’s tutorial. In our case we realized thanks to the FlameGraph that a bug in the UninitSlice::write_byte function from the bytes crate was causing abnormally excessive CPU usage.

Part 2: RAM profiling using heaptrack

Prerequisite: enabling debug symbols

Once again, you will need to enable debug symbols in your binary: refer to the section “Prerequisite: enabling debug symbols” of Part 1 of this tutorial if you need a reminder on what this means and how to accomplish it.

Prerequisite: elevating necessary security permissions

The memory profiler is going to require access to the ptrace() system call of Linux, which Kubernetes will not allow by default. Therefore you need to add the following securityContext to your container in your Pod’s (or Deployment or DaemonSet or StatefulSet, whatever you use) config YAML:

apiVersion: v1
kind: Pod
metadata:
name: my-app
spec:
containers:
- name: my-app
image: my-app:latest
securityContext:
capabilities:
add:
- “SYS_PTRACE”

Connecting to the container that you want to profile

We now assume the binary you want to profile has been compiled with debug symbols and is up and running on your Kubernetes cluster with the appropriate securityContextset up. Spin up a terminal and open a shell in the container that you want to profile:

$ kubectl exec -it <POD_NAME> -c <CONTAINER_NAME> -n <NAMESPACE> -- /bin/bash

Installing heaptrack

The RAM profiler you are going to use is Heaptrack. It is a modern Linux memory profiler that is part of the well-known KDE project. You may wonder why we are not opting for an even better-known profiler such as the famous Massif profiler from the Valgrind tool suite. The main reason is that just like perf did, Heaptrack supports attaching to a running process which is mandatory for our use case of profiling long-running web services, whereas Massif does not support this feature. A secondary reason is that Heaptrack is relatively fast and lightweight. While it still incurs a noticeable performance overhead while it runs, this overhead is much lower than that of Massif (which is known for slowing down applications by at least an order of magnitude). This makes it realistic to run Heaptrack even in resource-constrained production Kubernetes environments.

Heaptrack itself relies on GDB (the GNU debugger) at runtime, so you will need to install both in your container. If you are using Ubuntu or Debian, this is as easy as:

# UBUNTU and DEBIAN
$ sudo apt-get update && sudo apt-get install gdb heaptrack

If you are using Alpine, heaptrack is not officially distributed via apk, so you have to build it manually:

# ALPINE
$ apk add -- update gdb git g++ make cmake zlib-dev boost-dev libunwind-dev
$ git clone https://github.com/KDE/heaptrack.git
$ cd heaptrack && mkdir build && cd build
$ cmake -DCMAKE_BUILD_TYPE=Release .. # In theory we have installed all required dependencies previously and this should not return any error
$ make
$ ln -s /heaptrack/build/bin/heaptrack /usr/local/bin/heaptrack

You can then check that heaptrack was installed correctly:

$ heaptrack --version
heaptrack 1.3.80 # Your version number may be different

Retrieving your application’s PID

Once again you need the ps util for this (if you are on Debian, you will need to install the procps package, see section “Retrieving your application’s PID” of Part 1). Just like in Part 1, you can run the command and retrieve our application’s PID from the output of the ps -e command.

Running Heaptrack

You can now attach Heaptrack to your running executable, which will make it intercept and register all memory allocation and deallocation calls. Beware that while Heaptrack is running it will incur a performance overhead (although less significant than Massif would have). In our experiments we found that CPU throughput was approximately halved during the profiling and that an upfront extra 180MB of RAM was required for the Heaptrack process itself. If your container does not have enough spare RAM available, it may get killed when you try to run Heaptrack.

You can start the profile by running:

$ heaptrack -p <PID>
heaptrack output will be written to “//heaptrack.my-app.17789.gz”
injecting heaptrack into application via GDB, this might take some time…
# You may see some warning logs here, but they are not necessarily critical.
# In our case the following warnings appeared but did not impede the profiling:
warning: Missing auto-load script at offset 0 in section .debug_gdb_scripts of file /my-app.
Use `info auto-load python-scripts [REGEXP]’ to list them.
# Make sure the following line appears after the eventual warnings:
injection finished.

If Heaptrack refuses to launch and errors with "Cannot runtime-attach, you need to set /proc/sys/kernel/yama/ptrace_scope to 0" (which can typically happen on Alpine which is a more secure distro compared to Ubuntu and Debian), do the following:

  • add a field privileged: true to your container’s securityContext in its Kubernetes YAML file and redeploy your application
  • execute the command echo 0 > /proc/sys/kernel/yama/ptrace_scope in your container before reattempting to run Heaptrack
  • once you are done with profiling, revert the previous operation by executing echo 1 > /proc/sys/kernel/yama/ptrace_scope for security reasons (this kernel setting affects the whole node your container is on, not just its pod!)

As opposed to when you conducted the CPU profile using perf, note that here you are not specifying the duration upfront: you have to cancel the operation manually with <Ctrl-C> when you deem it has run for long enough. In our case we let it run for a few minutes before stopping it. Heaptrack stores the output in a file called heaptrack.<PROCESS_NAME>.<SOME_NUMBER>.gz.

Note: the output file is written to progressively during the profiling operation, which means you can fetch and analyze an intermediate output without stopping the profiling operation if you want it to keep running for longer.

Fetching the profile for analysis on your local machine

Once again, use kubectl cp to copy the file from the profiled container to your local environment:

$ kubectl cp <NAMESPACE>/<POD_NAME>:heaptrack.<PROCESS_NAME>.<SOME_NUMBER>.gz -c <CONTAINER_NAME> ./heaptrack.<PROCESS_NAME>.<SOME_NUMBER>.gz

This time you are going to use the heaptrack_gui tool on your local machine to open and read the profile output. If you are on Linux, here is how to install it for a few popular desktop distros:

# DEBIAN & UBUNTU
$ apt-get update && apt-get install heaptrack-gui
# Arch Linux
$ pacman -Syy && pacman -S heaptrack

If you are on macOS, unfortunately no pre-built binary is available. Nonetheless, you can follow the instructions from Heaptrack’s README on how to compile heaptrack_gui.

heaptrack_gui on Windows is not officially supported to my knowledge, but you may get through by using WSL2 (I have not personally tested this method and cannot assure you that it will work).

On execution, heaptrack_gui will prompt you to select a file containing profile data: this is where you want to select the heaptrack.<PROCESS_NAME>.<SOME_NUMBER>.gz file you previously retrieved.

heaptrack_gui file selection prompt
heaptrack_gui file selection prompt

Note: You will not need the “Compare to” and “Suppressions” functionalities of heaptrack_gui.

By default, the “Summary” tab opens, and presents to you a few interesting metrics that can already help you investigate the behaviour of your app. You can also switch between the tabs for different insights and modes of visualization. Because we worked with FlameGraphs in Part 1 of the tutorial, let’s click on the “Flame Graph” tab:

heaptrack_gui FlameGraph visualization mode
Resulting FlameGraph of the memory allocations of our application

Note: As you can see, there are still a few unresolved functions in our FlameGraph: this is because the application we profiled relies on several dynamically linked C libraries that were not compiled with debug symbols. If your application only relies on statically linked libraries and was properly compiled with debug symbols, all symbols should be properly resolved.

Conclusion

As you can see, CPU and RAM profiling of long-running Rust services in a Kubernetes environment is not terribly complicated, it simply requires a few tricks to work properly. Nonetheless it is a fantastic tool to have when you need to debug applications that exhibit performance issues, memory leaks, abnormal RAM consumption, etc. In our case, we were able to quickly identify that our CPU performance issues and memory leaks were caused by a specific version of one of our dependencies and were able to solve these problems by updating that dependency.

That’s it for today! I hope this tutorial will have helped you debug those performance problems you may have encountered. If you have any questions or comments regarding this post feel free to reach out to me via Twitter, LinkedIn or by email (erwan.delepinau@lumen.com).

The Ferris (Rust’s mascot) logo and the Kubernetes logo used in the thumbnail image are distributed under the Creative Commons attribution license. All other screenshots and images used in this article are ours. This document is provided for informational purposes only and may require additional research and substantiation by the end user. In addition, the information is provided “as is” without any warranty or condition of any kind, either express or implied. Use of this information is at the end user’s own risk. Lumen does not warrant that the information will meet the end user’s requirements or that the implementation or usage of this information will result in the desired outcome of the end user. © 2022 Lumen Technologies

--

--

Erwan de Lépinau
Lumen Engineering Blog

Software Engineer with a keen interest in anything that’s about zeros and ones