A Matter of Self-Expression

Making Sense of SQL Expressions for Query Optimization

What has happened so far?

This post is part of a blog series about the progress of writing a query optimizer for Hyrise, a main-memory column-oriented research database.
So far, we have written about:

  1. The concept of operators in Hyrise
  2. Our overall test & benchmark environment
  3. The overall architecture of our optimizer
  4. Database Statistics in Hyrise

In the previous blog posts we introduced the skeleton of our query optimizer and showed, for example, how we plan to make use of statistics. So far we have not discussed how we actually want to translate any of the declarative SQL into executable, imperative code. Leaving the optimizer aside this is actually a pretty straightforward, rule-based translation. But this post shall not focus on these translators — instead, we will shed some light on a concept much more complex: expressions in SQL.

SQL, in general, is a powerful language that supports complex, nested statements. In the WHERE condition, for example, the user is free to chain multiple predicates with logical operators such as AND or OR. At the same time, he is also able to add arithmetic calculations to the projection list or to join results from sub-selects. We could go on endlessly.

In short, the user is able to express the desired output of the query in a very flexible way. In Hyrise we decided to represent this flexibility with two tree structures. The first one is the abstract syntax tree (AST) that we use to optimize the general structure of the SQL statement. The second one comes into play when we talk about expressions. Certain nodes of the AST hold references to instances of such an expression tree, e.g. the PredicateNode or the ProjectionNode.

Why do we take on this burden?

Interpreting and actually understanding user-defined expressions is not a trivial task. But there are cases in which we can simplify the expression without really understanding the intention.

Especially with Hyrise’s current operators it is crucial to avoid complex expressions as much as possible: e.g., for now, TableScans only support a single condition. Whenever there are multiple predicates joined by an AND, this is translated into multiple TableScan operators. To save valuable execution time, it is important to skip any redundant TableScan. Let’s have a look at an example.

In this example, we have multiple predicates in the WHERE condition. As Hyrise operators currently can not solve arithmetic expressions, we have to add additional stages to pre-compute these results. Since each of these stages would result in additional computational overhead, we want to leverage the optimizer as much as possible. In detail, we want to define optimization rules that recognize arithmetic operations on literals, like those in the BETWEEN clause, and solve them.

Having a look at Hyrise Expressions in detail

Hyrise’s SQL parser (which can be found here and will be called hsql from here on) is an external component that is designed to be a standalone, database-agnostic SQL parser. It creates an AST from a valid SQL query. Part of this tree will be an Expr node wherever expressions occur. Explaining the structure of Expr is beyond the point of this article — the bottom line is that we do not use it to optimize the structure of expressions or to evaluate expressions in Hyrise’s execution engine.

Instead, we create a new binary AST from hsql’s Expr objects by creating nodes of type ExpressionNode and work with them from here on. An ExpressionNode can represent a variety of different things, for example an arithmetic or logical operation, a column reference, a placeholder, or even SQL’s asterisk operator ‘*’, which some people still use to select all columns of a table. When representing a simple expression (a + 5) * 3 the expression tree would look like this:

(a + 5) * 3

We deliberately chose to design our expression tree without using inheritance, giving the ExpressionNode a type enum member instead, a variant which contains the applicable value, if any, and an optional alias. This way the code for the ExpressionNode itself is very compact. The code using and analyzing the nodes doesn’t have the need for a lot of casting boilerplate either.

How about expressions that we cannot simplify?

As described above, Hyrise currently uses different operators to eventually execute supported expressions.
Any kind of arithmetic expression is performed by the Projection operator. Hyrise’s support here is still limited, for example sub-expressions in braces can’t be evaluated in one step and need to be executed by another Projection beforehand. While this currently only works for the Projection, the same approach could also be used for expressions in WHERE and HAVING clauses. Logical expressions in WHERE clauses are translated into nested TableScans, Unions or, if applicable, into Joins.
Just as one would assume, the Aggregate operator handles aggregate functions, usually as a final step of the query execution. Any nested arithmetic operation that is passed to the aggregate function has to be executed via Projection again.

This chart depicting the execution plan of a most basic query illustrates which different types of operators are used to execute different kinds of expressions.

SELECT AVG(a + b) FROM t WHERE a < b

How to move forward

With this flexible approach of expression trees we are able to work with SQL expressions in the optimizer. However, we are not able to circumvent the limitations of Hyrise operators, which cannot work with complex expression trees. The optimizer might be able to simplify some of these expressions to avoid executing unnecessary Hyrise operators. In the long run, Hyrise operators might support limited expression trees, so that we can reduce the number of additional Projections and TableScans even further.

Until then we will keep you posted with further insights into our query optimizer. Stay tuned!