from SQL to function calls
Welcome to our blog post series on Query Plan Optimizations for Hyrise!
Last things first: Hyrise is a research-oriented, column-based, in-memory database. It is currently being developed at the Hasso Plattner Institute in Potsdam, Germany. At the moment Hyrise supports many important database operators and indexes, but lacks the functionality to generate efficient query plans for SQL strings. That’s where we come into play:
We, that is Fabian, Jonathan, Moritz, Sven, and Tim, constitute the master project “Query Plan Optimizations” at the Hasso Plattner Institute. The master project is a software project spanning six months of our Master’s degree. Our main tasks are
- to evaluate existing research in SQL parsing and query optimization,
- to develop an optimizer that builds efficient query plans by utilizing database statistics, and
- to execute the TPC-C, -H, and CH-Benchmarks with efficient query plans.
This spring and summer we will write short blog posts on our findings and progress.
What are Operators?
Hyrise is based on the concept of chaining multiple operators to execute queries. An operator is a transformation of one or multiple tables, which again results in a table. A query can be translated into a tree of operators as nodes and data tables as leaves. Some exemplary operators in Hyrise are
- scan, which selects rows of a table if, and only if, they match a criterion,
- aggregation of a column, like SUM, MIN, or MAX,
- projection, which selects a set of columns and can also compute arithmetic operations row-wise between columns,
- join, merging two tables by using a set of columns as key, and
- sort, well, it sorts.
Now, how do we translate SQL statements to operators? In the following paragraphs we want to highlight some queries that require somewhat unorthodox usage of those operators.
It is important to realize that not every functionality possible in SQL needs to be implemented in a separate operator. That is, we can make use of a different or a combination of operators to achieve the same results.
Using GROUP BY for DISTINCT
SQL offers a way to select all rows from a relation exactly once, based on the values of one or more specified column(s) (DISTINCT). In Hyrise, we currently have an Aggregate operator which is being used to evaluate aggregate functions. SQL makes it possible to group the results by a set of columns by using the GROUP BY-clause. Effectively, GROUP BY creates a group for each distinct value of the column and evaluates the aggregate function for each of them individually. Therefore, we can simply use the Aggregate operator to provide the DISTINCT functionality: we group by the respective column, but do not provide any aggregate functions. The result will be a list of distinct values for that column.
Resolving Algebraic Expressions
In analytical workloads it is quite common to do simple algebra in aggregate functions. As an example, here is a simplified query from the TPC-H benchmark:
Hyrise is currently able to handle simple algebraic expressions in projections. Projections are used to describe which columns shall be present in the result table. As of now, these expressions are limited to the five most prominent arithmetic operators (+, -, *, /, %). Hyrise does consider the order of operations for multiplication/division and addition/subtraction. However, two limitations regarding the sample query remain: we are not yet able to handle parantheses, and arithmetic expressions are not supported anywhere else than in projections (e.g. in aggregations, as in the sample query).
We can avoid both of these with the same concept, which we call “multi-level projections”. First, we split parts of expressions that are in parantheses and handle them as the first level of projections. While this query does not have multiple layers of parantheses, queries that do have to be resolved recursively. Once expressions only consist of a combination of the five base arithmetic operators, they are projected as the next level. As a next step, aggregates can be applied to the columns that contain the values of the arithmetic expressions. Intermediate projections are assigned a unique alias so that they can be identified in the succeeding operator.
This concept, however, brings with it a significant disadvantage. We split an arithmetic expression into multiple ones, which results in multiple iterations over a column vector. Each iteration calculates an intermediate result, which in turn has to be stored. It would be faster to compute the expression as a whole so that intermediate values are not required. We consider this to be an open issue and aim to improve that during the course of the project.
Handling Different Types of Aliases
The query above highlights another concept that is essential for more complex SQL queries: aliases.
We differentiate between two different types of aliases: column aliases and table aliases. Column aliases are used to give a name to a particular column. Most frequently they are used in combination with aggregate functions, as seen in the query above. These aliases are therefore visible to a user interacting with the database. The Projection operator in Hyrise is able to provide this functionality.
Table aliases, in contrast, are only used internally. They are used to specify names for (temporary) tables in sub-queries and joins. Aliases need to be resolved first by the optimizer. Due to the operator chaining in Hyrise, we can resolve the aliases in the optimizer itself.
Re-writing Sub-Queries as JOINs
One of the most important features for advanced SQL queries is the ability to write sub-queries at almost arbitrary locations in the query. The results of these sub-queries can in turn be interpreted as temporary tables, just like with any other operation. However, supporting them in each and every operator is a difficult feat. Consider the following example query, again a simplified one from the TPC-H benchmark:
In this case, our TableScan operator would have to support sub-queries. Fortunately, we can re-write the queries that have sub-queries by joining the result of the sub-query instead:
The optimizer first handles the sub-query, and then joins the result to the original query. It needs to create aliases for both the sub-query (table alias) as well as the aggregate function (column alias). It can then re-write the filter expression.
As of now, we have not been able to construct a case in which this approach does not work.
Insight this week:
Operators can be used for multiple SQL clauses. More than you think.
That’s it. We’ll be back with a new post in the coming weeks.
Sincerely & Cheers,
Fabian, Jonathan, Moritz, Sven, Tim
First edit (June 1st, 2017): Clarify that DISTINCT also allows to specify multiple columns.