Query Plan Optimization: Sub-Plan Deduplication

Marcel Weisgut
Hyrise
Published in
7 min readAug 14, 2019

In the previous blog posts, we introduced the Hyrise master’s project of the summer semester 2019 and gave an overview of the TPC-DS benchmark. Since the beginning of the project, the number of TPC-DS queries that Hyrise supports was increased and we got insights about the execution times of 32 queries. In this blog post, we present how deduplication of query sub-plans increased the performance by a factor 5. For this, we first show the benchmark results of Hyrise compared with those of DuckDB, then we take a look at the three queries for which Hyrise took the longest execution time, and finally present how query plan optimization, especially query sub-plan deduplication, increased the performance.

Benchmark Comparison: Hyrise vs. DuckDB

To identify querying scenarios in which Hyrise performs poorly, we need measurements of concrete execution times. Additionally, in order to better evaluate these results, a comparison of Hyrise’s execution times with these of other databases would give further insights. The more reference values we have the better we can evaluate individual query execution times.

We only take DuckDB’s TPC-DS benchmark results as a reference value into account. „DuckDB is an embedded database designed to execute analytical SQL queries fast while embedded in another process“ [1]. It is mostly implemented in C++ and DuckDB’s contributors set up continuous benchmarking, which allows us to get up-to-date query execution measurements and the physical query plan for each executed query. This transparency is a good prerequisite for being able to compare measurements.

Figure 1 depicts the TPC-DS execution times for the queries that are currently supported by Hyrise. Additionally, Figure 1 also shows the corresponding execution times of DuckDB.

Specifications of the machine on which the benchmarks were executed:
4x Intel(R) Xeon(R) CPU E7-8870 2.40GHz; 10 cores per CPU (2 hardware threads); 1.5 TB main memory; 32 KB L1, 256 KB L2-cache per core, 30 MB LLC.

Figure 1: TPC-DS benchmark result comparison between Hyrise and DuckDB; scale factor 1, single-threaded

Figure 1 shows that the majority of queries can be executed faster with DuckDB. Hyrise’s execution times for the queries 31, 35 and 93 are particularly conspicuous. Whereas the majority of queries can be executed in less than two seconds, mostly even in less than one second, these three queries go far beyond the two-second mark. We focus on these three queries as they seem to offer the highest potential for performance optimizations.

Two General Optimization Approaches

In general, there are two approaches to optimize the performance of query execution. On the one hand, one can improve the current runtime system. This can be achieved, for example, by optimizing physical operators, by using parallelization, by data partitioning, data placement and pruning or data compression techniques, etc. On the other hand, query execution times can be reduced by improving the query optimizer, which transforms the initial query plan into the best available sequence of operations on the actual data [2]. This logical component has a crucial influence on performance. As mentioned by Thomas Neumann [3], the effort for improving the optimizer usually pays off and the impact is often much larger than the impact of the runtime system. Therefore, let’s try to optimize the query plan for query 31.

A Closer Look at Query 31

Before we dive deep into the query plan, let’s state out the query in natural language:

List counties where the percentage growth in web sales is consistently higher compared to the percentage growth in store sales in the first three consecutive quarters for a given year.

The SQL query for this can be found here.

Figure 2: Simplified Logical Query Plan for TPC-DS Query 31

Figure 2 depicts the simplified logical query plan of query 31 generated by Hyrise. A logical query plan (LQP) in general is a directed acyclic graph whose nodes loosely resemble the operations of the relational algebra [4]. Based on the LQP, a physical query plan (PQP) is created that determines which actual physical operators are executed in which particular order. Therefore, the previous artifact — the logical query plan — should be highly optimized.

In Figure 2, the nodes Tree A and Tree B are just placeholders for comprehensive sub-trees. For example, the sub-tree Tree A can be seen in Figure 3, which is an excerpt of the overall LQP of the query. Among other nodes, this sub-tree contains two join nodes and one aggregate (GroupBy) node. Depending on the configuration, executing the corresponding physical operators can be very expensive.

Figure 3: Excerpt of the Overall Logical Query Plan Including “Tree A”
Figure 4: Physical Query Plan for “Tree A”

Figure 4 depicts the PQP for Tree A. Especially the aggregate operation with an execution time of more than one second is enormously expensive. Since both Tree A and Tree B are found three times in the query LQP, each corresponding partial PQP is executed three times. Even if the PQP of Tree B is not as expensive as the PQP of Tree A, it obviously reveals optimization potential.

The concrete execution times of the sub-trees on the previously mentioned machine are about 1.65 seconds (Tree A) and 0.45 seconds (Tree B). The overall query takes about 5.42 seconds. Therefore, the execution time might be reduced by 2*(the execution time of Tree A) + 2*(the execution time of Tree B), which would result in an execution time reduction of about 4.2 seconds.

Sub-Plan Deduplication: Creating Diamond Structures

The redundant execution of identical physical query sub-plans leads to avoidable execution time. Since a PQP is based on the previously created and optimized LQP, the problem can be solved on the LQP level by using duplicate elimination of logical query sub-plans. This can be implemented as a final phase in the query optimizer. This process uses depth-first traversal and integrates sub-plans into a memorization structure (e.g., a set), starting at the leaves of the overall LQP and works upwards to the root. If a sub-plan SP shall be stored but an identical one already exists in the memorization structure, the already stored sub-plan is set as a child of the parent(s) of SP and SP itself is discarded. This results in diamond structures as shown in Figure 5. The procedure is similar to Cascades‘ duplicate detection search for substitute expressions [5].

Figure 5: Improved Simplified Logical Query Plan for TPC-DS Query 31

Another approach to avoid the redundant execution of equal physical operators is to identify equal LQP sub-trees on the translation level. As mentioned before, the optimized LQP is translated into a PQP. During the translation, a hash map stores each created physical operator for the corresponding LQP node. Each time when an LQP node is translated, a lookup in the hash map is executed. If a semantically identical LQP node has already been translated and thus a physical operator has already been stored in the hash map, this operator is used as the result for the LQP node to be translated. After all, this leads to diamond structures in the overall PQP, while the overall LQP has no diamond structures.

Resulting Performance Improvement

To be able to measure the actual performance impact on query 31 using the explained procedure, the sub-tree duplicate elimination on the LQP translation level was implemented in Hyrise. This second approach was chosen since the proposed approach on the LQP optimization level would lead to architectural issues which are related to further projects and thus are not trivial to solve. Since the elimination of multiple executions of identical physical operators is the key aspect of this optimization strategy, the optimization on LQP translation level is also sufficient and was faster to achieve.

Figure 6 shows Hyrise execution times for TPC-DS queries where applying the duplicate elimination strategy results in a performance increase by a factor greater than 1.5. The execution time of Query 31 was reduced by 3.56 seconds. Therefore, the performance was increased by a factor of about 3. Besides the highest absolute improvement of Query 31 with 3.56 seconds, the highest relative improvement was measured in Query 88. The achieved factor is about 5.5.

The previously calculated performance improvement of 4.2 seconds for Query 31 was not achieved. The reason for this has not been investigated in detail. A partial reason for this is the effort for looking up the sub-trees in the hash map, which — of course — also costs execution time.

Figure 6: Resulting Performance Increase of Sub-tree Duplicate Elimination

References

  1. CWI Database Architectures Group. 2018. DuckDB — An Embeddable Analytical Database, [Online]. Available: https://www.duckdb.org/ (visited on 07/26/2019).
  2. Garcia-Molina & Hector, Hector & , Ullman & D, Jeffrey & , Widom & , Jennifer. 2002. Database Systems: The Complete Book.
  3. Thomas Neumann. Engineering High-Performance Database Engines
  4. Markus Dreseler, Jan Kossmann, Martin Boissier, Stefan Klauck, Matthias Uflacker, Hasso Plattner. 2019. Hyrise Re-engineered: An Extensible Database System for Research in Relational In-Memory Data Management
  5. Goetz Graefe. 1995. The Cascades Framework for Query Optimization

--

--