Taming SQL for Query Optimization
This is the final post in a blog series about the progress of writing a query optimizer for Hyrise, a main-memory column-oriented research database.
So far, we have written about:
- The concept of operators in Hyrise
- Our overall test & benchmark environment
- The overall architecture of our optimizer
- Database Statistics in Hyrise
- Handling SQL Expressions
- Referring to columns in our optimizer
How we started
Back in April 2017, when we started this project, the rewrite of Hyrise was a merely half a year old. It was a research database written from scratch, built on clear design principles and clean engineering.
But: There was no convenient SQL interface available! Take this simple query:
Instead of four lines in SQL it took 20 lines of C++ to set up operators, their parameters and to connect them appropriately. That looked like this and is neither nice to read (as you will surely agree) nor to write (as we can assure you).
Our project “Query Plan Optimizations for In-Memory Databases” tried to overcome these flaws by adding support for SQL into Hyrise. Why not just call the project “SQL support for Hyrise”, you ask? When it comes to executing a statement in a declarative language like SQL, there are usually many different ways to calculate the requested result. Some are so efficient they will finish within the blink of an eye, others will literally take days. In the field of database engineering these ways are called query plans. Accordingly, the challenge of finding the best query plan, e.g., with regards to performance or memory consumption, is subsumed by the term Query Plan Optimization.
Thus, whenever you add SQL support to a database, you pretty much need a Query Plan Optimizer as well, these two just go hand in hand. And while supporting SQL is complicated as well (as we would soon learn…), Query Plan Optimization is just the way more interesting topic from an academic perspective.
Let’s take a look back and see how we approached it, shall we?
Stage I: Test-Driven Development for Query Optimization
Query Optimization is all about achieving performance in a database. And when you are writing a Query Optimizer you need to measure your improvements somehow. Luckily, there are standardized benchmarks for databases out of which we picked TPC-C and started to implement it.
Next to obtaining a benchmark to verify our soon-to-come optimizations with, we also aimed at automatically cross-verifying Hyrise’s results of the benchmark’s queries with those that SQLite produces. We really needed to make sure our optimized Query Plans were actually doing something valid!
The first big push back that we faced was implementing these benchmarks. On one hand we had to implement table generators according to the specification of each benchmark. On the other hand we had to re-implement the queries as operator trees which is a hazardous, error prone process that took a lot of time. Still, we implemented two of the TPC-C transactions in this manner and backed them off with an SQLite-driven testing, before steering around to implement a much more user friendly SQL interface. After all, these manual operator trees still find their role as a performance baseline for the SQL translator and optimizer.
Stage II: Translating SQL to Operators
Next, onto SQL support for Hyrise: luckily there is already an open-source SQLParser from a previous Hyrise version available, which we used for the parsing and lexing stage. The parser’s output is then translated into C++ objects representing the initial, naive query plan. The Optimizer will operate on this representation of the query, until finally another translator generates the operator tree from it.
While the second translator (Query Plan to Operator) is pretty much straightforward due to our design of the query plan nodes, the first one (SQL Syntax Tree to Query Plan) involved more work. At this stage we have to take care of all the specific characteristics of SQL, which, naturally, is quite a bit of work.
Stage III: Referring to columns using ColumnIDs
Aliases and name mapping in general was a topic that gave us quite some headaches when we initially approached it. To a large extent this was because SQL offers manifold ways to form queries in which name resolution, i.e. figuring out to what table and which column a name is referring to, is complex even to the human spectator. Let us look at an exemplifying query here:
When we started to reason about how to treat this mix of names, (sometimes optional) prefixes, and aliases when rewriting the query plan during optimization, we soon realized that the complexity would just go over heads (…just think of flattening a nested query and what you need to do to keep prefixes and aliases unambiguous!).
We realized we needed an approach that gets the whole naming/referring domain out of the way for optimization — using ColumnIDs to refer to columns is our answer. We dedicated a whole blog post to this topic, so we won’t go into details here, but what it boils down to is this: when translating the SQL syntax tree (which still uses strings to refer to columns) to our internal syntax tree, we replace every column-referring-name with the index of the column in the syntax tree node’s input table. Now optimization rules, when changing the tree, just need to keep these indices valid. Even better: We don’t need to do any name resolution in the execution engine anymore! A predicate wants to scan a table’s column 3 to be equal to a literal? Just grab Column::_columns, it doesn’t get any easier!
And just as a little illustration, this is roughly what the above query looks like with ColumnIDs if you would translate it back to pseudo-SQL. Note that the names “InnerResult”, “T1” and “T2” would not exist any more either and be replaced with references to either the left or right input of a syntax tree node. We just left them in here for readability.
Interlude: Estimating table sizes with Statistics
This one is essential! The goal of the Query Optimizer, among other things, is to formulate the Query Plan with the smallest intermediate results. The smaller the inputs to an operator are, in general, the faster it will be able to process them. Since we cannot exactly know, say, how many rows the join of two tables with a specific predicate will produce, we need to heuristically estimate them. In order to do so we need statistics about the data found in the columns of tables.
Our Statistics Component offers a very simple interface: For any node in the Query Plan, it will give its best shot at estimating how many rows the output of the node will have.
During our project we achieved support for those requests to some degree and built an extendable infrastructure. Our statistics assume equally distributed, independent columns, which is, to be honest, a long way from reality, but serves well for a start. These statistics are based on minimum, maximum, row count and null value ratio. We currently support statistics for all typical requests, including the most complex statistical estimations, namely predicates and joins.
Future improvements should include update strategies, and more detailed statistics like histograms to support more accurate estimations.
Our statistics component allowed developing optimization strategies using this information to balance between different query plans.
Stage IV: Transformations Rules
To execute a query, we could transform a SQL Query string to a syntax tree and translate this to an operator tree directly. The results it produces would be correct, but would probably come in at crippling speed. The translator that creates the initial query plan, derived directly from the SQL syntax tree, doesn’t pay any attention to performance.
This is where the Optimizer comes in. The optimizer knows about a set of algorithms, called rules, that transform one query plan into a functionally equivalent one in such a way that the performance will (hopefully) improve.
The optimizer will apply its rules to the query plan with no special attention to order until either an iteration budget is exhausted or the rules stop changing the query plan. This iterative approach allows for loose, implicit dependencies between rules: One rule could be responsible for ordering a hierarchy of Predicates and another one could be responsible for merging sets of Predicates that are not directly connected. This approach makes a clean separation of concerns possible without having to deal with dependency management of rules.
Remember, in any case, with any individual rule applied, the result of the generated query plan will not change.
We would have liked the implementation of such rules to be the majority of our efforts, since they are what makes and breaks an optimizer, but given the circumstances we ended up just implementing two of them. However, we’re confidant that we set the ground and the environment for more rules to come.
One of the rules we added, just to give you an idea what we’re talking about, is quite an essential one: join detection for cross joins. Consider a query like this:
The initial translation contains a cross product/join of tables a and b that will probably generate a very large table, before a predicate filters those rows where “a.id = b.id”. Obviously this is a waste of memory and computation power! The join detection rule will detect pairs of cross joins and predicates that can be merged together to form an inner join and transform the query plan accordingly. The new query plan, translated back to SQL, would look like this:
This rule, e.g., applies to the TPC-H benchmark extensively. Executing some of its queries without this rule applied beforehand would probably break the bounds of any computer’s memory, even on moderate input table sizes, because of repeated cross products.
So, while all query plans generated by the SQL translator are semantically valid, some would not be executable because of their memory consumption or general performance. As we implied at the very beginning of this article: The Optimizer and its Optimization Rules are some of the most vital parts of any database.
Outcome and Conclusions
Our initial goal of the project was to fully support TPC-C and TPC-H benchmarks and optimizations on an SQL level. After 6 months we have to admit that this goal was just a bit too ambitious. On our journey we often faced unexpected, but time-consuming problems — solving them properly cost time, but helped to improve Hyrise and set a base for further developments on the Optimizer.
At this point, Hyrise supports six out of 22 TPC-H and all of the TPC-C queries with minor modifications.
Even more beneficial for Hyrise is that the architecture of the optimizer and translators is quite simple and easy to enhance. Due to the switch to ColumnIDs we completely got rid of column name resolution in the execution engine and just have the respective code at one location.
PS: Lessons Learned
- C++ is sometimes (well, regularly…) frowned upon, but tame it a bit, be disciplined and do not over-engineer things and it almost never gets into the way.
- Different parts of a database need different data structures to represent queries. Do not try to find the holy grail that fits them all!
- Declarative languages are nice for query formulation, but hard to optimize.
- You cannot plan a research project right from the beginning, it will evolve.
- Good supervision makes a motivated team, makes good progress, makes a happy team, makes happy supervisors ;)
- Writing regular blog posts forces you to frequently reflect on what you’re doing and can actually be quite entertaining.