Need for Speed: Hyrise

Tools to improve C++ code performance

Toni Stachewicz
Hyrise
4 min readMay 15, 2018

--

We (Adrian, Marcel and Toni) constitute the master’s project “C++ Low-Level Performance Engineering” at the Hasso Plattner Institute. The previous master’s project “Query Plan Optimizations” shared seven blog posts in the Hyrise series. We are going to share some posts during the next months about how we improve the performance of Hyrise.

When Hyrise was developed over the course of the last months, many of the new features were programmed with a high focus on adding more database functionality. We do not want to contend that performance was disregarded. However, we see a high potential for improvements in some of those recently added components. The goal of our master’s project is to identify and understand bottlenecks in the code. In order to fix those performance offenders we use code profiling tools.

What is Code Profiling?

Code Profilers are tools that dynamically analyze the runtime behavior of programs. There are many interesting factors influencing the quality and performance of software. Generally, we want to examine the used resources (e.g. CPU usage, usage of main memory), duration of specific methods or concurrency behavior. Thereby, we are able to locate performance bottlenecks in our database project. Profiling helps us to focus on critical codepaths and optimize the expensive parts instead of optimizing for the last two percent of performance. We conduct our code profiling on the compiler-optimized binary during runtime of the database system. The resulting information of a run depends on the profiler itself. Most of them provide a statistical summary of the observed events.

Different Profiling tools

There are different profiling techniques and tools. We are currently getting familiar with some tools that depend on different methodologies, e.g., Instrumentation, Sampling or Hardware Performance Counters. Each of these techniques has different advantages and drawbacks. The overhead these tools introduce might increase the execution time. Inaccuracies and approximate measurements can also occur while profiling.

Callgrind (Valgrind)

Valgrind is a suite of different tools for profiling and memory debugging. Basically, it is a virtual machine, which uses Just-In-Time compilation techniques in order to dynamically analyze programs. One tool of this instrumentation framework is Callgrind. The usage of that is straightforward. Simply add a valgrind call when executing the program’s binary and profiling is started. In order to analyze the performance of a program, callgrind adds instructions to it. Hence, the execution time will increase tremendously (potentially by a factor of 100). The tool creates a file containing all examined information. Since this file is not in human-readable format, you may want to use a visualization tool. In our project we use kcachegrind. It provides an understandable and interactive user interface, including a flat profile and a call graph as depicted in Graphic 1.

Compared to other profilers using sampling, this method is very precise when it comes to runtime measurements of a method call. The only drawback is the slow execution time as described before.

Graphic 1. User interface of kcachegrind

Apple Instruments

Apple Instruments is a part of Apple’s Xcode tool set. Apple Instruments provides, amongst other things, a time profiler, which measures the total execution time for each function. These times are derived by sampling. This implies that it is interrupted regularly and the function call stack gets inspected. After program execution, the approximated runtime can be calculated based on those function call stacks. The GUI shows total execution times and overall weights.

The tool is easy to use, provides a comprehensible UI and does not need much setup in order to profile software. After selecting a program’s executable profiling can start. A major drawback of instruments is the operating system dependency. It is only available for macOS. Furthermore, the results of a profiling run are not as precise as callgrind’s outcomes. Due to the use of sampling, statistical errors occur and distorts the results.

Hardware Performance Counters (PAPI)

Modern microprocessors have special registers reserved for hardware performance measurements. These counters can store hardware-related metrics (e.g. cycle count, cache misses or branch mispredictions). They are freely available and must be initialized on a particular metric. The number of hardware counters is limited and depends on the microprocessor. The Performance API (PAPI) provides an application programming interface for accessing the hardware performance counters. The following example shows how we can use PAPI to count certain events.

In this small example we count the amount of total CPU cycles spent and CPU cycles waiting for memory access (cf. line 7). The event counting will be enabled in line 9 in order to examine the method expensive_method(). After the method returns the event counting is disabled and the metrics are gathered in line 11. One of the main advantages of PAPI is the comparably low overhead.

Conclusion

We tried out multiple code profiling tools, experienced their advantages and disadvantages. With the help of these performance profilers we want to find bottlenecks, eliminate them and thus, make Hyrise even faster. The benchmark results, improvements and insights will be shared in our next posts.

— Adrian, Marcel and Toni

--

--