Query Plan Optimization: Disjunction Split-Up

Jonathan Janetzki
Hyrise
Published in
4 min readOct 15, 2019

In our previous blog post, we showed how optimizations of the Logical Query Plan (LQP) could improve the performance of Hyrise in the TPC-DS benchmark. When we presented the Sub-Plan Memoization as our first optimization, we pointed out that especially TPC-DS Query 35 still has performance bottlenecks. In the following, we introduce our new Disjunction Split-Up optimizer rule, which vastly decreased the runtime of this query.

A Closer Look at Query 35

Figure 1: TPC-DS Benchmark Result Before we Added Optimizations; Scale Factor 1, Single-Threaded

Among all of the 39 TPC-DS queries that Hyrise can execute, Query 35 sticks out because it has by far the longest execution time (see Figure 1). With 56s, it takes more than ten times longer than any other query.

Figure 2: SQL Excerpt from TPC-DS Query 35

Figure 2 shows an SQL excerpt from Query 35 containing the critical part. In natural language, this part resolves to:

List customers who have purchased from the web or the catalog.

The full SQL query can be found here.

Figure 3: Logical Query Plan Excerpt from TPC-DS Query 35

Figure 3 shows the general LQP structure of this part: a predicate node (WHERE clause) accessing two correlated subqueries. For each customer (table node), this leads to one execution of both subqueries. While Subquery A filters all customers who have purchased from the catalog, Subquery B filters all who have purchased from the web. For both of them, the correlated parameter is the customer’s ID. We are going to generalize this pattern and describe how it can be optimized.

The repeated execution of both subqueries results in the observed poor runtime behavior. We could overcome it if the subqueries only had to be executed once. There is already a so-called Subquery to Join Rule in Hyrise that flattens such subqueries. It pulls out the correlated parameter and then uses this parameter to perform a hash semi join on the input node with each of the remaining subqueries.

However, the Subquery to Join Rule only optimizes subqueries that are at the top-level of the predicate expression and ignores subqueries that are concatenated with a predicate. To make the subqueries accessible, we need to simplify the predicate expression into something that the rule can process. Hyrise already supports the Predicate Split-Up Rule, which splits conjunctions (ANDs) into predicate “chains”. To optimize structures like the above (see Figure 3), Hyrise also needs to split up disjunctions (ORs) to predicate “diamonds” (see Figure 4).

Implementation

We extended the Predicate Split-Up Rule by adding a Disjunction to Union Rule. Wherever it finds a disjunction in a predicate node, it breaks it apart into two predicate nodes and merges their results with a union node. If the original logical expression (e.g., A AND (B OR C OR D)) contains more than two atomic expressions, we split it up recursively.

Figure 4: Logical Query Plan Excerpt from TPC-DS Query 35 After Applying the Disjunction Split-Up

Our pattern from Query 35 provides an example of a Disjunction Split-Up with only two atomic expressions. As explained for the sub-plan memoization, we can create a diamond structure to avoid redundant executions of an input node. Figure 4 shows the resulting LQP.

The subqueries are now top-level expressions of the predicate and Hyrise can reformulate subqueries that are concatenated with any logical expression to a much faster hash semi join by applying the Subquery to Join Rule.

Even if a query contains no subqueries, the Disjunction Split-Up Rule may improve the execution time because Hyrise evaluates multiple predicate expressions with atomic logical expressions faster than single ones with complex expressions: The TableScan operator, which evaluates predicates in Hyrise, has multiple scan implementations to choose from. Instead of using a general-purpose implementation for concatenated predicates, the TableScan can now use scan implementations that are optimized for a single predicate.

Evaluation

In total, the Disjunction Split-Up Rule improves the performance of 18 TPC-DS and TPC-H queries by more than 20%, which are 30% of all executable queries, while no queries were negatively impacted. Overall, we achieved a 55% performance improvement for TPC-DS and a 5% performance improvement for TPC-H.

The execution time of Query 35 was reduced from 56s to 0.26s, which is a performance improvement of 22,000%.

--

--