The Brain of Every Database
Using Statistics to Guide Query Optimization
What has happened so far?
This post is part of a blog series about the progress of writing a query optimizer for Hyrise, a main-memory column-oriented research database.
So far, we have written about:
In the last weeks, we have done quite some research about how other query optimizers operate, which of their strategies are applicable and would be beneficial to Hyrise, and how our optimizer might use concepts specific to Hyrise to further improve query plans.
As a general theme, we noticed a few things:
- Query optimizers heavily rely on selectivity estimates.
- Selectivity estimates are based on statistics about value distribution.
- These statistics are often incorrect.
As a result, the estimates are often incorrect, too. This can be traced back to two major reasons. First, modifying data operations change the value distribution. Updating the statistics with every modification incurs significant costs and is therefore typically not the prevalent approach of other databases.
Second, the estimates are usually calculated based on certain assumptions about the data. Most importantly, one assumption is that the values of two columns are independent of each other. This, however, is often not true for real-world data. A car model, for example, is not independent of the manufacturer (not every manufacturer offers every model).
Unfortunately, due to the lack of alternatives, statistics are pretty much all we have, and we use them as best we can. It’s almost as if Andrew Lang was thinking about us when he wrote:
“He uses statistics as a drunken man uses lamp-posts —
for support rather than for illumination.”
— Andrew Lang
Supporting the quest for optimal query planning
Implementing statistics in Hyrise
To decrease complexity we decided to separate the implementation of statistics from the optimizer and encapsulate it. The goal is to be able to predict the output size of any of the Hyrise operators. We stick with the concept of operator chaining, which allows us to use the predicted result for one operator as the input for the next prediction for the following operator.
To illustrate this, we look at a SELECT query which returns the orders of a customer with id 8169 from 2017. The table holds 1,000,000 orders of 10,000 customers from the last 25 years. We assume a uniform value distribution. To get the expected output size of our query, three steps are necessary:
- First, retrieve the number of entries in the order table: 1,000,000.
- Then, predict how many of the, e.g., 1,000,000 orders are from the customer with id 8169: 100.
- Finally, using the intermediate result of 100 orders predict the number of orders from 2017: 4.
Now that we know how the optimizer uses the statistics component we look into how the result size prediction works.
Typical use cases for statistics are the computation of expected output sizes for predicates and joins. These sizes only depend on the values in the filtered columns respectively the columns used by the join condition. Therefore, statistics are calculated per column. A column statistic includes the distinct count as well as the min and max values of the column. A column statistic is either a base column statistic computed using the actual column values or a filtered column statistic. This filtered statistic is computed using filter values, filter operand, and the corresponding column statistic. Our current implementation computes a base column statistic on demand and then caches it. This approach does not update already computed statistics for inserts, updates or deletes. All column statistics of a table together with the table size are part of a table statistic. Table statistics with filtered column statistics describe intermediary operator results and are therefore not cached. The optimizer expects a table statistic and only reads the table size from it.
Keeping intermediate results as small as possible
One of the fundamentals of query optimization is the aim to reduce the data to be processed as early and as much as possible, for fewer data allows faster processing.
Less is more unless you’re Al Gore.
— Jeff Rich
A straight-forward approach to reducing the amount of data is to optimize the order in which TableScans on the same table are being executed. In Hyrise, we currently execute TableScans one after another. As a result, if the first filter already reduces the amount of data significantly, the second filter needs to check fewer values. Consider the following, simple query, which retrieves the total amount of all orders for a given customer this year:
To decide which of the two filters we apply first, we need to reason about their selectivity. With the help of the statistics we collect about columns, we can estimate how many values we expect as a result of a filter operation.
For the given example, the underlying data likely has more rows matching the first filter than the second. That is, there have likely been many more orders in the current year than for a single customer. Right now, we would reach that conclusion based on the number of distinct values in a column. Suppose there are 10,000 different customer_ids in orders, we have stored orders for the last 25 years, and we have 1,000,000 orders in total. Then we expect to have about 1,000,000 / 25 = 40,000 orders in 2017, but only 1,000,000 / 10,000 = 100 orders for each customer_id. We would, therefore, re-order the filters and execute the customer_id filter first.
As of today, Hyrise is able to re-order equality predicates based on the assumption that all values in a column are distributed uniformly.
A set of rules to optimize them all
The type of transformations that we just had a look at is based on statistics. However, there are also rule-based optimizations that can be applied regardless of the actual data. These logical transformations of a query have been in place for many years and are most likely implemented by any query optimizer in the field.
A simple example for such a logical transformation would be to simplify expressions by folding constants. Especially in settings where SQL queries are generated at runtime, it might be the case that a query contains constant expressions. By evaluating this kind of expressions in the optimization phase the execution phase hopefully becomes cheaper and thus faster.
Consider the following query, in which we can simplify the predicate:
There are also more complex logical transformations possible, e.g. unnesting of subqueries. In SQL subqueries are usually executed independently of the parent query. Thus, transformations like join ordering and access path selection for the parent query cannot include the subquery.
However, there are rules that allow unnesting of these subqueries by replacing them with an equivalent join. Let’s have a look at the following query:
We can transform the subquery into the following join, as long as customer.id is unique:
There are dozens of other logical transformation rules, such as expanding ORs into UNIONs or join factorization. All these rules have in common that they improve the query basically for free, as they do not need any statistics or cost estimation.
Traditionally predicate pushdowns are also part of the rule set. However, for Hyrise we want to use the cost model to decide whether it is worth to push down a predicate, e.g. through a join. Let’s have a look at an example to clear this up. Usually, you want to filter your join inputs to have small intermediate results. But, assume for now that you have a filter condition that does not actually reduce your amount of data significantly. Further assume that your join condition is very strict, producing a join result with only a few rows. In that case, our cost model should figure out that pushing down the filter predicate will lead to another, expensive full table scan, which the join has to do anyways (due to the low-selectivity predicate). On the other hand, a table scan on the join result is rather cheap to do and should be preferred in situations like that.
Choosing intermediate query representations
As described in the previous sections, there are different techniques that an optimizer can make use of in order to generate a good query plan. We currently identify three major stages: applying the set of rules that universally lead to an improved query plan (e.g., subquery unnesting), using statistics to re-order parts of the query (e.g., predicate ordering), and finally identifying a good join order, if applicable.
Since the optimizer is also responsible for choosing the best operator for a certain part of the query, we want to model these parts as closely to the operator structure as possible. We decided to transform the output of the parser to a binary tree, for all of our operators (see our first blog post) have between zero and two inputs. We will use the binary tree for the first two stages.
Once these are complete, we currently plan to use the Join Graph structure proposed in Vertica, which itself refers to Starburst, to enumerate possible join orders and choose the one expected to perform the quickest. That structure focuses on the items that are to be joined and therefore does not necessarily need the additional information that is included in the binary tree.
The road ahead
One of our primary focus areas in the next weeks will be to implement the binary tree representation of a query and subsequently the first optimizations using that structure.
Additionally, we will continue to extend and improve both the accuracy and performance of the statistics we collect. Histograms will be crucial for estimating the cardinality of more complex queries, and we consider them to be our next milestone in that area.