Shifting to the next gear

Toni Stachewicz
Hyrise
Published in
5 min readJun 21, 2018

A line of code that causes 13 % performance loss

In the last blog post, we described different profilers, their usage, advantages and disadvantages. In this article, we provide an example of finding a performance bottleneck with the help of such code profilers. Naturally, we also provide a solution which accelerates the slow code path.

Introduction

One of our project objectives is detecting performance critical code snippets. In order to find lines or snippets of code that negatively affect performance, we have to execute and profile them. We do that while performing benchmarks on Hyrise. A commonly used and standardized benchmark is TPC-H.

It is actually a decision support benchmark consisting of a collection of SQL queries for data analysis. The queries run on generated tables that simulate business-oriented data. Our research database already supports the TPC-H benchmark, but not to its full extent. Since Hyrise is still under development, it does neither support all query types nor data types yet. Hence, we can only run 8 out of 22 queries with slight modifications. Nevertheless, the benchmark is well-suited in order to find potential bottlenecks and improve the overall database performance.

Finding the first bottleneck

First of all we ran the benchmarks with Valgrind’s Callgrind tool enabled (see last blogpost). For each query we created a stacked bar chart that depicts the relative time needed per operator (see Graphic 1). Since Callgrind uses the instrumentation profiling technique, additional instructions are added to the code. Thereby, the execution time increases tremendously. As we were only interested in approximate times, we chose a relatively small scale factor for the created tables. When doing this, it is important to keep in mind that this can lead to inaccurate results: When a table grows linearly, operator run-time often grows superlinearly (e.g. joins).

Graphic 1. Distribution of time needed per Operator

Therefore, we also analyzed the performance with Apple Instruments and a higher scale factor. Additionally, we evaluated the time measurements given by Hyrise itself. The database is capable of printing the physical query plan including the execution times per operator.

The paper “TPC-H Analyzed: Hidden Messages and Lessons Learned from an Influential Benchmark” contains a choke point analysis. It shows a performance focus area for each TPC-H query. We can compare this with the profiling results of Hyrise. This allows us to find operators in Hyrise that cause a large execution time.

In the beginning, we paid special attention to TPC-H query 1. We noticed that a lot of time is consumed by the projection operator. However, some operators such as joins and aggregates are expected to have rather large runtimes whereas projections tend to be faster. Due to that fact we started digging deeper with Callgrind.

Graphic 2. Operator time distribution with different profiling tools

With the aid of a call graph generated by KCachegrind (see Graphic 3) we found the performance offender rather soon. Almost 34.5 % of the time spent in this component goes to the “get_column()” method.

Graphic 3. Call graph of method within the projection operator

What does this method do? Let us have a closer look into it (cf. code snippet below). The benchmark query contains many operations on the data stored in different tables. The database operators access specific columns from these tables during execution. The method “get_column()” helps to read the actual data in a column. An operator can pass the ID of a column to the aforementioned method. This will return a reference to the table’s column. It will be used for further operations. As soon as we iterate over the values of this temporary result the method “get_column()” is called amongst others.

Why does this affect performance?

The method “get_column()” is part of the so-called “hot loop”. This is the part where the database accesses the actual data or runs operations on it, such as comparisons. These loops are usually executed repeatedly. In this example, we call the “get_column()” method every time we access a single value. Since the values in the underlying column do not change during the execution, we can safely store the column reference in a variable. Let us assume our column has one million entries accessed by the database. The aforementioned function will be called one million times. Running it once would be sufficient in this case.

As if that were not enough, in line 2 the method performs an atomic load. Thus, a lock is created to prevent other threads from accessing the variable. This is a very time consuming part and takes 14.6 % of the method’s execution time (cf. Graphic 3). If Hyrise runs with multiple threads — which it definitely does in production — every single thread has to wait to get a lock. Accessing the variable becomes a sort of sequential process.

Do we really want to wait?

When a thread writes data to a memory location and another thread reads or writes data at the same location, it could lead to a “data race”. Hence, the program behavior is undefined. Atomic operations help to avoid a “data race”. Nevertheless, we saw that the returned columns are read-only objects. Thus, we do not need an atomic load in this method and therefore we do not want to wait.

Is Hyrise faster now?

As expected, the answer is yes. We increased the performance of TPC-H query 1 by 13 %. There is more good news: Not only this specific query, but also other TPC-H queries benefit from this change. The execution times decreased by at least 2 % (TPC-H query 13) up to 14 % (TPC-H query 7). If we have a look into the projection operator via kcachegrind again, the “get_column()” method is no longer present in the call graph, since the time spent within this method is negligible.

Stay tuned!

— Adrian, Marcel and Toni

--

--