Adding some Nitro

Marcel Ja
Hyrise
Published in
4 min readSep 4, 2018

Iterate referenced data faster

During our master’s project, we used code profilers several times in order to inspect the runtimes of many Hyrise components. At one point, we realized that accessing values while iterating over columns consumes much time. To be specific, iterating over columns that reference data from other tables. These columns are called ReferenceColumns in Hyrise. Since there are a lot of spots in the code where this happens, there is a high potential to improve the overall performance by just changing the way we iterate over those ReferenceColumns.

Why do we have ReferenceColumns?

Normally, the tables in the database have columns that contain values. When executing an SQL statement, Hyrise executes different operators (e.g. a table scan). The output of such an operator is a new table that only contains references to the actual values. By that, much less memory is needed to build the returned table. Let us assume we have the following table and want to execute this SQL statement:

SELECT * FROM “person_table” WHERE “age” < 42;

The resulting table has three entries. Both columns of the resulting table (output_table in the figure below) share a position list (pos_list) which points to the specific rows in “person_table”. The good thing is that we only need one position list for all columns of the input table. Otherwise, we would have to materialize and store each value which is matched during the table scan. For other operators (e.g. join), the ReferenceColumns are even more important because the resulting table might have much more rows than the input tables.

Hyrise tables consist of chunks

There is one more thing to understand how ReferenceColumns work. Tables in Hyrise are usually split into into multiple chunks. For the chunk concept to work, the position list also needs information about the corresponding chunk. Therefore, it stores an ID of the chunk and the offset of the value in the specific chunk. If the table scan example’s input table would be split into two chunks, the position list would look as follows.

Columns in Hyrise might be encoded to reduce the memory footprint. This can also speed up operators because less data needs to be transferred. The first chunk (chunk 0) is encoded with dictionary encoding. Thus, the first chunk stores the content of “name” as DictionaryColumn<std::string> (‘Alexander’ and ‘Mark’). Additionally, it stores the ages 27 and 49 as DictionaryColumn<int>. The second chunk (chunk 1) has the native, uncompressed encoding. Unencoded chunks have columns called ValueColumn. Therefore, the second chunk has one ValueColumn<std::string> and one ValueColumn<int>. Both ValueColumn and DictionaryColumn (and a few other columns) inherit from a class called BaseColumn. This is illustrated in the following class diagram.

The ReferenceColumn is referencing columns that can have different types. Therefore, it has pointers to BaseColumns. When we want to iterate over a ReferenceColumn, the BaseColumn in the current chunk is accessed and the subscript operator [] is called with the given offset of the value. Because the subscript operator [] is implemented as a virtual method of the BaseColumn, the call is redirected to the associated templated column. The BaseColumn is not a templated class and does not know the type of the values. Therefore, an AllTypeVariant gets returned, as shown in the class diagram. An AllTypeVariant is an object in Hyrise that can store values with different data types (e.g. strings, integers or floats). This is comfortable to use, but has a downside. There are additional construction costs and it adds more indirections. We would like to avoid this while iterating over columns because that costs performance.

Our solution for that problem is to introduce new column accessor classes. These classes are templated. The data type (T) of the values stored in the underlying columns are known. The ColumnAccessor<T, ColumnType> is also templated with the type of the accessed column (e.g. ValueColumn<T>). Every column containing actual data and not just references must implement a new method called get_typed_value(). Thereby, the column accessor can redirect the accesses to this method of the specific column. The ReferenceColumn can store pointers to BaseColumnAccessor<T>. Now the ReferenceColumn can access values without creating an AllTypeVariant object.

To check if our new approach is faster, we looked at the TPC-H benchmarks. More information about the TPC-H benchmarks can be found in one of our last blogposts. The following diagram shows that all benchmarked queries performed better.

This is the last blogpost from our master’s project “C++ Low-Level Performance Engineering”. It was a pleasure to give you some insights about what we did during the last months. We hope that you enjoyed the reading and remember how to use profilers and speed up your code.

Cheers,

Adrian, Marcel and Toni

--

--