How Much Is The Fish?

Maintaining and Exploiting Database Statistics in Hyrise

Published in

Hyrise

10 min readJul 17, 2017

What has happened so far?

This post is part of a blog series about the progress of writing a query optimizer for Hyrise, a main-memory column-oriented research database.
So far, we have written about:

In the last blog post, we introduced the use of statistics in our query optimizer and how they are currently implemented. This time we give an outlook of how we think statistics could be used effectively.

So what is the goal again? We need an approximate row count of operator results to find a fast execution plan for a query. Let’s have a look at a simple example:

TABLE FISH

Example Query

How much is the fish?
– Scooter

So we ask ourselves: What is the row count of the resulting table?

Statistics help to save time during query execution, but calculating them and especially maintaining them is time-consuming. Let’s have a look at the extreme cases:

We could just assume the row count to be always 30, which doesn’t have any costs but also doesn’t help to increase the quality of the optimization.
Oppositely we could actually execute the operators and use the real row counts. This would have the highest possible quality, but also the maximal overhead — queries would actually take longer than not using statistics at all.

Statistic Types

So what is in between the two extreme cases? We consider three basic types of statistics:

Uniform, Independent Columns
For this type of statistics, only the row count of the table and the distinct count and the bounds of its columns are stored. Assuming that the data is equally distributed and no functional dependencies exist, this simple model results in good approximations, as shown later in this post. Nevertheless, it lacks the precision when multiple columns are compared for equality, as the contained values are unknown.
Independent Columns: Histogram
A histogram can model the value distribution of a column very precisely. The only assumption is that it does not have functional dependencies.
Modeling Functional Dependencies: Multi-Dimensional Histograms
When a histogram is given for more than one column, it can handle also functional dependencies up to the degree of dimensions of the histogram. In the extreme case, this corresponds to the whole database without duplicate rows.

The usage of those statistics is explained more in detail below.

So far, we see statistics as a simple trade-off between quality and costs, but there is no clear sweet-spot. Let’s have a closer look:

What is quality?

We argue that the quality of statistics in this context is not only determined by its precision, but it should rather support the optimizer in minimizing the total query time. So we are interested in having more precise statistics especially for

tables and columns that are often queried, and for
data that does not follow simple rules (e.g. outliers, data with trends).

So the quality of the statistics should adapt to the complexity and to the query rate of the underlying data. This can be achieved in multiple ways:

The statistic types can differ between tables and columns, depending on the distribution of the underlying data. If a column is equally distributed and independent from other columns, it should be modeled with simple statistics instead of histograms.
Even within columns, different statistic types can be used. Having partial histograms might be very helpful here, as outliers in the data can be modeled explicitly, whereas the other data can still be modeled with uniform statistics.
Histograms use binning to aggregate multiple neighbored values, which is sufficiently accurate if the data follows certain trends. However, it might also be useful to be even more precise and count every single value by using a frequency/value distribution.

Teaching Hyrise How To Fish

As of now, statistics in Hyrise are created on demand — that is, when the optimizer tries to improve a query plan, it requests information from the statistics component. If the statistics component, in the process of answering this request, encounters a table or column for which statistics do not exist, it creates them first.
Statistics for these objects are cached and will be used from that point forward. Currently, they are not updated.

Give a man a fish, and you feed him for a day.
Teach a man to fish, and you feed him for a lifetime.
– Anne Isabella Thackeray Ritchie

Update Strategies in other Database Systems
It is obvious that this approach can and has to be improved upon. Most importantly, statistics become stale quickly when data in a table or column is modified frequently. The accuracy of estimates and, in consequence, query performance would suffer and therefore has to be addressed.

Most database systems collect and update statistics periodically. Periodical updates can be scheduled either over time (e.g., every day [Oracle]) or based on events (e.g., every k data modifications, with k depending on the number of rows in the table [SQL Server]).

Statistics vs. Sampling
Additionally, to estimate cardinality, query optimizers might decide to not rely on existing statistics at all and use sampling instead. The given operation will be executed on a subset of the data and the estimates will be based on the result of that sample.
Sampling has the advantage that it always operates on the current data. Its downside is that it is often significantly more expensive and complex than using statistics. With sampling, every time a query is processed by the optimizer, the optimizer will execute the operators on some part of the data, which incurs higher cost than looking up existing statistics. While the results can be cached for specific operators and values, the optimizer cannot easily draw conclusions for different values.
In contrast, statistics do not have to be updated with every statement. In today’s typical workloads the majority of the statements are read operations and as such do not have any impact on the statistics.

One of the key parameters for sampling to decide on is sample size. A small sample size incurs the least cost but is also the least accurate. Additionally, samples should be chosen randomly, which implies that data access is scattered and does not benefit as much from prefetchers and caching as a regular, sequential scan in a column-oriented database does.

Which of these two approaches works better depends on the specific situation. In fact, more advanced database systems dynamically decide which one is to be preferred for a given query. This depends on a variety of factors. If those systems find statistics to be outdated or simply insufficient for the given type of query, sampling will be used instead. Most prominently, many optimizers resort to sampling to estimate cardinalities for LIKE filters (regular expression patterns) or custom-defined functions. Maintaining accurate statistics for these operations is either prohibitively expensive or just downright impossible.

Challenges of Periodical Updates
Updating statistics periodically directly implies that they are likely to be outdated most of the time. If the update period is time-based, a lot of INSERTs, UPDATEs, and/or DELETEs in a short period of time might render the existing statistics inaccurate.
If the period is based on events instead, modifying operations have to be counted and managed. Luckily, we do not require these counters to be perfectly accurate, and therefore do not have to worry too much about thread-safety and other potentially complicating circumstances. Additionally, we can control the degree of inaccuracy that we want to tolerate for the statistics by lowering the threshold after which statistics should be updated. We, therefore, consider event-based updates superior to their time-based counterparts.

Updating Statistics On-The-Fly
As described in our first blog post, Hyrise executes queries as a set of operators being executed in order. Each operator uses as input the results of one (e.g., scan, sort) or two input operators (e.g., join) and itself produces a result.

The optimizer in its process of finding a good query plan for a given query requests a number of statistics about the result of an operator from the statistics component. These statistics include, e.g., the estimated number of rows, min and max values, and distinct count. Arguably the most important one of them is the estimated number of rows, for if it is off significantly the subsequent estimates will be off as well, and the optimizer is unlikely to generate a high-quality plan.

Consequently, one of our top priorities has to be to estimate this count as accurately as possible. Our idea is to make use of the intermediate results that are produced by each operator. The optimizer would, once it finishes a query plan, annotate the operators with the estimated statistics for each step. After an operator has computed its result, the estimated cardinality can be compared with the actual size of the result. This check is inexpensive because it only requires the size of the result table (a std::vector::size() call of constant complexity per chunk). If the actual cardinality differs from the predicted one by a certain fraction, we would update the statistics.
We would be able to learn quickly if we estimate incorrectly.

Experience is what you get
when you did not get what you wanted.
– Randy Pausch

There are a few advantages that come with this concept.

We can collect certain statistics almost for free.
If a predicate is used frequently, we will also have up-to-date statistics, as the results can be compared to the estimates frequently.
We are able to inexpensively track functional dependencies for sets of columns that appear together frequently. This is one of the most common causes of inaccurate statistics and is traditionally rather complex to detect.

Compared to the statistic types shown before, the concept of on-the-fly updates combines high quality statistics (especially for frequently used data) and low maintenance costs.

Nevertheless, this type of collecting statistics is not equipped to fully replace the other techniques. First, we cannot update statistics for each of the predicates of the query because the result of the second filter already depends on the result of the first (this is exactly the reason why we can track dependencies between columns, though). Second, the type of filter and the type of tracked statistics have to be compatible. If, for example, we have a filter looking for a specific value (equality scan), and we only have a bucketed histogram, we cannot simply update the whole bucket.
Additionally, it requires communication between the operators or scheduler and the statistics component, which currently are not connected at all.
We consider this approach to be very promising, but we will not focus on it in the intermediate future due to the fact that Hyrise is currently still missing more pressing functionality.

Exploiting Statistics within Hyrise

Now that we got an overview of possible, sophisticated statistics computation approaches, let us take a step back and have a look at our current implementation. For simplicity’s sake, we only take the min, max, and distinct count for our basic estimate into account and leave histograms for later. Although histograms will replace the basic estimates, this initial implementation guarantees a fallback, if they are not available for a column.

In our last blog post, we introduced our architecture of table and column statistics. To recap, a table statistic holds column statistics for each column and describes the expected output of one operator. Table statistic can compute an estimated output size for any operator applied on the corresponding table. Once a table statistic receives an estimate request for an operator, the table statistic passes it to the corresponding column statistics to get the selectivity factor of the operator. The product of the calculated selectivity factor and the previous row count is the new row count.

Predicate result prediction

To illustrate the computation of the selectivity factor let us look at the SQL query again, which is run against the fish table above.

The example query again.

The query consists of the two concatenated predicates, length greater than 85 and weight equal to 7.

To start off with the first one, the column statistics for the length column retrieves the min and max values of the column, 35 and 165. They allow an easy evaluation of edge cases for the operation greater than whether all or none values are selected. Obviously, the selectivity is 1, if 85 is less than min, and 0, if 85 is greater than or equal to max. As this is not the case the selectivity will be greater than 0 and less than 1.

Min and max values define the value range of a column. The selectivity is the relative size of the new value range in comparison to the size of the current one. The new inclusive value range is 86 to 165 as all values are greater than 85, which results in a selectivity of:

(165 - 86 + 1) / (165 - 35 + 1) = ~0.6

Wondering about the +1? The value range consists of all possible integers, which also include min and max values. Although not used here, the distinct count of the column can be used by operations and, therefore, needs to be kept updated. This happens by multiplying it with the selectivity:
0.6 * 4 = ~2.4

Finally, the column statistics returns the predicted row count by multiplying the selectivity with the row count: 0.6 * 5 = ~3

The equal to operation from the second predicate requires a slightly different selectivity calculation. Again, min and max values are used to check the edge cases.

Due to the common goal of a SELECT query to retrieve values, we expect that the operation will have entries in its output. This leads to the conclusion that the compare value matches one distinct value in the corresponding column. We assume a uniform value distribution over the distinct values. Therefore, the expected selectivity is the inverse value of the distinct count 4: ¼ = 0.25

To finish up the calculation within the column statistics, the new distinct count, min and max values are set to 1, 7 and 7. Then, the table statistics returns the updated row count: 3 * 0.25 = ~0.75

Next steps

Currently, we work on supporting estimations for operators that use multiple columns, e.g. joins.

Additionally, we are working on Hyrise’s support for more complex expressions, and continue to integrate more and more TPC-H queries into our benchmarking suite.