Halving the TezEdge node’s memory usage with an eBPF-based memory profiler

Published in

TezEdge

12 min readJul 14, 2021

As developers, we want to see how much memory is used in each piece of code so that we can evaluate whether a particular function in code costs us too much memory. For instance, if an error causes an increase in memory consumption, we need to be able to identify it as soon as possible. To accomplish that, we need a tool that records and displays the consumption of memory in real time. This also allows us to reduce memory consumption, which is beneficial for running software in a low-memory environment.

For this purpose, there are many already existing tools available for profiling memory. In our case, existing memory profilers do not provide the details that we specifically need for the rapid debugging of the TezEdge node.

Additionally, they may consume a lot of storage and memory space. In this implementation of the profiler, we focused on using as little memory as possible and avoiding any storage use. The tradeoff is that it only displays allocations in the current moment, which preserves more space on the server for the node itself.

For these reasons, we decided to create our own memory profiler for the TezEdge node by utilizing extended Berkeley Packet Filters (eBPF), a technology we’ve previously used in the firewall of the TezEdge node’s validation subsystem.

To see a detailed overview of the node’s current memory consumption (in real time), visit this link:

https://tezedge.com/#/resources/memory

Continue reading this article for more details on how our memory profiler works and how to use it.

Locating memory leaks

We want to use the profiler to identify which functions cost how much memory, and thus optimize the way the node uses memory. A very effective way of lowering memory consumption is to search for memory leaks.

A memory leak is a bug in code which causes the code ‘forget’ to mark a memory region that is not used anymore as unused, leading to unnecessary memory consumption. Over time, even a single memory leak that is executed many times will cause an application to stop running as it will eventually exhaust all system memory.

Satisfying the hardware limit

Most memory profilers work with virtual memory, but we want to detect physical memory so that we can work within a hardware limit. Let’s say the hardware where the node is running has only 4 GB of RAM, which means we need to monitor physical memory usage rather than virtual.

The TezEdge node is usually consuming more virtual memory than actual physical memory, because some memory regions might be zeroed and never referenced. Some might be ‘copy-on-write’. The operating system can detect such a situation and does not spend read hardware capacity. That is why measuring virtual memory consumption does not help to satisfy a hardware limit.

Memory usage tracing

The BPF module is a file in the ELF format which contains several code sections which are written in the eBPF instruction set. Each section is bound to some tracepoint in the Linux kernel, so when the kernel hits the tracepoint, it executes the eBPF code. The main purpose of this code is to transmit some information about the event which triggers the tracepoint. Additionally, the eBPF code contains simple state machines to filter out irrelevant events.

The BPF module of the memory profiler collects events related to physical memory allocation and the launching of processes.

It intercepts the syscall ‘exec’ that happens at the moment the TezEdge node is launched, it stores its process identifier (PID). If the TezEdge node was launched before, the memory profiler does not intercept the syscall and will not know the PID. This is why it is important to launch the memory profiler before you launch the TezEdge node.

Once we know the PID, the BPF module intercepts every memory allocation that happens in the context of this PID. Most of the pages allocated in the context of our PID are also deallocated in the context of the same PID, but this is not always the case.

When the application writes to a file, the data is not immediately written and instead it might stay in the memory for some time. That usually happens when the system has a lot of free memory. The OS presumes the application will read or write again to the same place in the file, so it postpones the actual file input/output (IO) operation because it is expensive performance-wise.

This technique is called write-back cache. The kernel allocates a memory for such cache. The cache is deallocated when the kernel decides, usually it happens when there is not enough memory. The cache is deallocated when someone deletes the file the cache is dedicated to, or, eventually, the actual writing in the file happens. That is why the BPF module should look at every memory deallocation in the system, not only those caused by the TezEdge node (which can be identified through its PID).

Collecting stack frames

The BPF module calls a helper to unroll the stack of the TezEdge node. That’s how we can link memory allocation events with pieces of code. Stack unwinding requires some compile-time preparation. Every executable binary and dynamic library should be built with “-fno-omit-frame-pointer” (C and C++) or “force-frame-pointers” (Rust) option.

The dynamic library is the library which is not included in the TezEdge node, but is instead loaded dynamically (at runtime). First of all,it is glibc (c standard library) and libstdc++ (c++ standard library). However, it is also libev and libsodium.

Therefore we built them from sources with “-fno-omit-frame-pointer”. As we anyway should build libraries from sources, we consider to build our own truly distroless docker image. We were using one from Google, but now we use our own so that we can control what is bundled in our docker images.

The option “-fno-omit-frame-pointer” instructs the compiler to generate some extra instructions in each function. The instructions are storing a pointer to the previous stack frame on the beginning of the current stack frame. At the beginning of the stack frame is a return address to the previous function. This return address is a virtual address of instructions somewhere in the code of the function. This makes a linked list on the stack, so the kernel is able to traverse this linked list and collect return addresses from each function in the call stack. This technique has some runtime cost, but the decrease in performance is not noticable.

To illustrate the stack unwinding, consider this c code:

Here we have the sequence of calls, foo calls bar, bar calls baz, baz calls printf which cause a syscall. When we compile this code with optimization (except for two certain optimizations -O2 -fno-omit-frame-pointer -fno-optimize-sibling-calls), we receive the following assembly:

The movq %rsp, %rbp instruction before each function call makes the rbp register store the current stack pointer. At the beginning of each function, the rbp is stored on the stack: pushq %rbp, and at the end of each function it fetches rbp from the stack: popq %rbp.

The instruction call something is a shortcut for two actions: pushq %rip (pseudocode) and jump something. The stack contains %rip which is an instruction pointer. And the instruction ret is shortcut for two actions: popq %r and jmp %r (it is pseudocode just for illustration). Therefore each stack frame contains an instruction pointer “%rip” and a link to the previous stack frame “%rbp” one right after another. In the Linux kernel it is represented by the structure:

And here is the code that traverses this linked list:

It stores the return address, and then jumps to the next stack frame in a loop until it exceeds a limit, or hits something that does not look like a stack frame.

Processing data

The userspace part of the profiler listens to events from the BPF module. It performs multiple tasks parallely:

Resolution of virtual addresses

The first task is the resolution of the virtual address of the function. From the BPF module, we receive a collection of virtual addresses which are call-stack. We need to determine which file is mapped on this address and which assembly function is this address pointing to.

Once the profiler receives the first event, it knows the PID of the TezEdge node. With a given PID, it fetches from /proc/<pid>/maps file the descriptions of each memory area on the address space of the TezEdge node. Among others, it contains the descriptions of memory areas of the executable code light-node binary and shared libraries used by the node. This description allows translation from the virtual address of the function into filename and offset in the file where the function is. This map can change dynamically, because the node might dynamically load shared libraries.

The monitoring of /proc/<pid>/maps is a continuous process. Every time we have a new shared library (or executable file) in the map, the profiler loads the .symtab, .dynsym and .strtab sections from it. In most cases it allows the translation of the offset into the symbolic name of the function. The last step is to translate the name into a human readable form.

2. Group allocated pieces of memory by call stack

The second task is to group allocated pieces of memory by its call stack. The profiler tracks the amount of allocated memory in each distinct function.

3. Preparing the memory usage report

The last task is serving http requests and preparing reports of memory usage as a tree-like structure. The function names and amount of memory are displayed in the nodes of the tree.

The result should be a visualization of real time usage of physical memory by each function of the TezEdge node and system libraries.

The profiler itself uses around 1GB of memory. Half of that is a buffer between the kernel and userspace. The buffer contains events and call-stack pointers. Each allocation produces around 1kb of data, and sometimes many allocations/deallocations happen very fast, so the buffer is necessary to process all of them. Currently, the profiler does not use the disk at all.

Visualizing memory consumption

Each rectangle represents a point in the TezEdge node’s code which causes memory allocation to happen. You can click on a rectangle and see the points in the code which cause the rectangle you chose to be executed and so on, all the way up to “__clone” syscall or another point in code that runs the entire application.

The entry point of the application is the “__clone” function in the glibc library. Functions call other functions and return back to the previous function (caller). It can be visualised as a directed graph — functions are vertices of the graph and calling events are edges (arrows) of the graph.

Eventually some function can call the kernel (do a syscall), or provoke page faults which might allocate some memory. We only monitor those arrows of the graph that cause memory allocation. When we intercept such an arrow, we trace it back to “__clone”.

Unfortunately, that is not always possible. At the root of the frontend are all such functions that cause memory allocation, here we can see the total consumption of the memory together with it’s biggest consumers. Therefore, on the frontend, we see this call-graph in reverse, from the end to the root. For an illustration, see the picture below.

On the frontend we will see “d” and “f”, if you click on “d” you will see “c”, “f” and “clone” — all the functions which have an arrow to “d”. If you click on “f”, you will see only “b”, because there is only a single arrow to “f”. All paths begin in “clone”, so on the frontend, all of them should end in “clone”. However, that is not always the case, because some arrows are invisible for the profiler and the tracing is interrupted, hence why the frontend sometimes displays “unknown”.

The tooltip shows the full path of the function, size and total nested children functions.

In this animation, you can see that the navigation also works by using the table and the breadcrumbs.

Comparison of RAM usage after optimization

Throughout the development of the TezEdge node, we’ve been able to greatly reduce RAM usage, part of which has been achieved thanks to the memory profiler. You can see a comparison as demonstrated by the screenshots below. You can also view the TezEdge node’s post-optimization RAM usage in real time by visiting tezedge.com.

In both graphs, we hovered our cursors over the 1 hour point of the node’s timeline, meaning that the node had been running for one hour. The point in time is marked by a gray line crossing through 3 colored dots (blue, green and purple). Note that the one hour mark is located in different places in each of the three graphs as each node’s version was running for a different amount of time.

Additionally, do not confuse the gray line denoting the 1 hour mark with the right-most gray line that crosses through three colored dots. This line marks the latest (current) measurement of RAM usage.

TezEdge 1.4

TezEdge 1.6 and later (after optimization via memory profiler)

Note that the pre-optimization and post-optimization graphs each represent a different timespan as well as different amounts of RAM usage.

How to use

Prerequisites:

Please note that the kernel version should be at least 5.11 to run this software. Ubuntu 21.04 already has kernel 5.11. If you are running an older OS, refer to the OS manual to update the kernel.
The Docker version has to be version 20.10.6 or newer.
docker-compose has to be version 1.29.1. or newer.

Step-by-step guide:

Please make sure that your kernel version is 5.11. Once you’ve made sure that you have this as well as all the other the up-to-date versions, then type this into the command line:

git clone https://github.com/tezedge/tezedge-debugger.git
cd tezedge-debugger

The first two steps clone the source code from GitHub and move to the directory. Strictly speaking, not all source code is necessary, we only need the docker-compose.yml file, which is part of the source code. Alternatively you can take the docker-compose.yml file from the GitHub page.

3. docker-compose pull && docker-compose up -d

The third step runs the TezEdge node along with the memory profiler and the front end.

In 10 to 30 seconds you can see the result at http://localhost/#/resources/memory in your browser.

Run docker-compose down -v to stop the node and the memory profiler and clean its temporary files.

If you wish to run the memory profiler and the TezEdge node from docker image without docker-compose, you should help the profiler access the same environment as the TezEdge node has.

The profiler can copy the light-node binary and shared libraries from the TezEdge image at runtime. To enable such behavior, you need to set TEZEDGE_NODE_NAME

environment variable and mount /var/run/docker.sock from the host system. Also, the profiler’s docker container needs privileged rights to insert the BPF module in the kernel. Then mount directories /sys/kernel/debug and /proc from the host. The image will work on port 17832. The whole command might look like this:

docker run — rm — privileged -it -e TEZEDGE_NODE_NAME=<<name of tezedge container>> -p <<port you choose>>:17832 -v /var/run/docker.sock:/var/run/docker.sock:rw -v /proc:/proc:rw -v /sys/kernel/debug:/sys/kernel/debug:rw simplestakingcom/tezedge-memprof:latest

You need to choose the TezEdge node container name prior to launching the container. First, run the memory profiler using the command above, and provide some container name. Then run the TezEdge node using the same name as a container name.

We thank you for your time and hope that you have enjoyed reading this article. If you have any questions or feedback, feel free to contact me. To read more about Tezos and the TezEdge node, please visit our documentation, subscribe to our Medium, follow us on Twitter or visit our GitHub.