Parallelization efforts using VTune

Published in

Hyrise

4 min readOct 18, 2019

Up to this point, we focused a lot on query plan optimization based on individual long-running queries (Sub-Plan Memoization, Disjunction Split-Up). Additionally, we also wanted to take a closer look at the Hyrise execution engine. For this, we used Intel’s performance profiler VTune™ Amplifier to analyze the runtime behavior of Hyrise while executing TPC-DS queries.

Analysis

VTune Amplifier Hiccups

VTune Amplifier offers plenty of options and reports to play around with. For our studies, we found the basic Hotspots analysis to be sufficient.

Our test system consisted of four Intel Skylake 28-core processors, forming a NUMA system with 224 logical cores. We profiled our Hyrise TPC-DS executable in multithreaded mode, simulating 28 clients.

You might ask why we simulated 28 clients only. It is because we were running Hyrise and VTune on a single NUMA node. In the beginning, we tried to use all available NUMA nodes. But, using a sampling interval of 5 – 10ms, the amount of collected data simply became too massive for the system to handle. We experienced numerous system crashes with messages indicating a ”sample loss“ being the reason for this.

Interpreting the VTune report

There are many ways to look at a VTune report. For our purposes, we found the “Bottom-up” perspective to be the most useful one. Grouping by Source File and sorting by CPU time, we ended up with the following view (Figure 1):

Figure 1: Screenshot, Intel® VTune™ Amplifier 2019 for macOS

As you can see, the top three time-consuming files are join_hash_steps.hpp, bytell_hash_map.hpp and validate.cpp.

For our parallelization efforts, we decided to have a closer look at validate.cpp, which implements the MVCC validation process required by almost every transaction within Hyrise. In comparison to the hash join (join_hash_steps.hpp) and the third party hashmap (bytell_hash_map.hpp), the concept and implementation are rather easy to understand.

MVCC Validation

What is MVCC?

Hyrise is an insert-only database. It implements the Multiversion Concurrency Control (MVCC) concept to support efficient concurrent access. In the course of this, Hyrise stores additional metadata for each row: Namely, begin- and end-commit-ids. A validation process implemented in validate.cpp uses these attributes to determine whether rows are visible to transactions or not.

Implementation of the MVCC validation

Looking at validate.cpp, we noticed that the aforementioned validation process was implemented in a purely sequential manner. This means that for each transaction, tables have to be processed sequentially row by row at least once.

The MVCC validation effort is significant: In case of TPC-DS, scale-factor 1, the inventory table contains 11,745,000 rows. So, in most cases, queries using the inventory table require validation of roughly 12,000,000 rows.

Parallelization

Strategy

In Hyrise, tables “are horizontally partitioned into chunks“ ¹. This means that operations, such as the MVCC validation, process rows chunk-wise.

Because the MVCC validation acts as a simple filter, there are no conflicting dependencies between input and output chunks. There is also no need to retain the order of input rows. Therefore, chunks can be validated independently of each other.

We create a validation job for each input chunk and add it to Hyrise’s scheduler. The scheduler, maintaining a number of worker-threads, schedules our jobs appropriately. After execution, the validation jobs add their results to a dedicated result table using a synchronization mutex.

Our implementation worked flawlessly, but we changed one aspect: In some cases, a single job should process not just one, but multiple chunks. The past showed that jobs for tiny amounts of work create too much scheduling overhead. We try to ensure that each job has around 100,000 rows to process. The latter number is the default chunk size and proved itself to be a good choice in terms of throughput for the TPC-H benchmark ¹.

Performance Results

To evaluate our changes, we benchmarked our (executable) TPC-DS queries before and after the parallelized MVCC validation. We chose a TPC-DS scale factor of one and a system capable of handling 56 threads concurrently.

The following chart (Figure 2) represents the relative runtime reduction in percent per query. We use a relative time representation to also reflect the performance improvements of short-running queries.

Figure 2: Relative query runtime after parallelizing the MVCC validation

As you can see, we achieved a noticeable reduction in runtime for 29 queries (blue bars). For four queries (grey bars) though the validation is not a cost factor. Some reasons as follows:

Query 10 involves an EXISTS(..) OR EXISTS(..) instruction. At the time of creating these numbers, latter instruction resulted in a high runtime of around 60 seconds, making the runtime impact of the MVCC validation rather negligible. We tackled this expensive instruction later on by implementing our Disjunction Split-Up optimizer rule.
In query 28, the optimizer reorganizes the query plan: Before the MVCC validation, tables are filtered by predicates. The resulting tables are very small and do not profit from validation in parallel.
Query 81 and 83 are based on tables with less than 150,000 or 100,000 records.

In summary, across all 33 TPC-DS queries tested, we accomplished an average performance boost of around +40 percent.

References

Markus Dreseler, Jan Kossmann, Martin Boissier, Stefan Klauck, Matthias Uflacker, Hasso Plattner. 2019. Hyrise Re-engineered: An Extensible Database System for Research in Relational In-Memory Data Management, openproceedings.org