Benchmark-driven Query Optimization

A few weeks have gone by since we started our master project “Query Plan Optimizations”. In our previous post we introduced the next version of Hyrise, a research database developed at the Hasso Plattner Institute. By now we have focused on setting up a benchmark environment, which shall support us in developing the query optimizer for Hyrise.

Since the main focus of a query optimizer is to improve the overall performance of query execution, we will use these benchmarks to measure our progress over the span of this project, hopefully seeing a decreasing trendline. At the same time this setup helps us verifying that certain optimizations are logically correct by checking the query results.

How we benchmark

In the field of database systems TPC benchmarks are probably the most popular, each focusing on a specific kind of database workload. In our case we decided to focus on

  • TPC-C, a transaction-based warehouse scenario, and
  • TPC-H, an analytical enterprise scenario.

In the first case the system records orders from customers for certain items and processes their payments. TPC-C queries are typically selecting and updating only few rows from a table. The second scenario models an enterprise system, which shall process and analyze large amounts of data to provide insights and support decision-making. In contrast to TPC-C this benchmark uses more complex queries, e.g. joins across multiple tables and aggregates on multiple columns. Due to their nature analytical queries typically are more complex than their transactional counterparts, and therefore take more time to process. More specifically, poor query plans for complex queries can have more severe consequences for the overall performance, making the optimization of TPC-H queries both more important and challenging.

To make sure that benchmarks on different database systems are comparable, both specifications define rules how to generate test data. They not only provide the table schemes on which the queries are based, but also table sizes and value distributions for each column. We implemented these rules within Hyrise, keeping in mind that we will use these tables not only in the benchmarks but also in additional test cases. For example, we scale table sizes, so that tests operate on tables which are smaller than those for the benchmarks.

Test Setup

In order to get a baseline for how Hyrise performs speed-wise on TPC-C transactions we started with manually converting the TPC-C-Queries into Hyrise operator trees. No attention to operator-ordering was paid here — so any automatically optimized plan will need to be at least as fast as these manual conversions, or we’re doing a very poor job when writing the optimizer.

Next to setting a performance baseline for our (later to follow) optimizer-generated plans, this also allows us to structure our benchmark without actually having an optimizer at hand just yet and helps us to mentally visualize the restructuring of these plans that the optimizer will have to perform.

Now, a benchmark is of no use if you cannot be sure that it is actually performing the logic you want it to and is not producing correct results. We need to make sure these manually and automatically generated plans work as intended. This is especially important when we implement new clever tricks in the optimizer: we, as developers, need immediate feedback whether these optimizations work, even outside of the scope of often all-too-basic unit tests.

Also, wouldn’t it be nice if we could test the optimizer using the same kind of queries we will deploy when benchmarking it, so we can be as sure as possible the benchmark is generating valid results without having to do extensive tests in the benchmarks themselves?

Yes, it would be, so the plan is the following: In order to heavily load-test both our manually written TPC-C plans as well as the plans generated by the optimizer we generate a big (but not quite as big as for the benchmarks) set of TPC-C requests (NEW_ORDER, ORDER_STATUS, etc.) and test whether Hyrise returns correct results from SELECT queries as well as transforms the database into the correct states with UPDATE/DELETE/INSERT queries.

Wait, but how do we know what the correct results from a “big” number of requests are? Well, the TPC-C tables are clearly too big and the number of requests will be too high for us to manually “perform” the queries. Instead, we will use an established database system, feed it with the same tables and TPC-C requests as we will later feed to Hyrise and use the results it generates to verify Hyrise’s behavior. And since SQLite is painlessly accessible from a Python script and is established enough for us to expect it to work correctly, we will use SQLite to be our ground truth.

Technically the whole setup looks like this:

  • We generate a bunch of TPC-C requests and store them sequentially in a JSON file. To illustrate this, an ORDER_STATUS query looks like this:
  • We create the tables of the TPC-C benchmark as specified and dump them as CSV files.
  • We load these CSVs into SQLite and process the TPC-C transactions from the JSON file. The results of the transactions are collected into another JSON file. The result of an ORDER_LINES transaction might look like this:
  • After all transactions are processed, we dump the resulting state of the SQLite database into CSV files again.
  • Now it is Hyrise’s turn! We load the same CSVs, process the same transactions, compare their results against SQLite’s and finally compare the contents of the tables after all transactions are completed.
  • If Hyrise passes all these tests we can be pretty confident it works for any TPC-C transaction and our benchmark results are actually meaningful.

A few possible culprits need to be kept in mind here, especially that the order of rows returned by a SELECT statement is not defined if it doesn’t contain an ORDER BY referencing a unique key. This becomes additionally unhandy when the results are LIMITed. Where necessary we have made adjustments to the TPC-C queries in order to make them deterministic and avoid these problems.

Speed Center

Every new commit triggers the execution of all benchmarks. To monitor and analyze the performance, the collected runtime data is stored in a speed center, which is an open source web application available on Github.

It shows for every benchmark the last runtime, the difference to the previous run and the trend for the runtimes.

The runtime over multiple commits can be viewed in charts. As an example of a runtime chart the one below shows the runtime of a cross join benchmark on two tables with 5,000 rows each.

This graph clearly shows a significant increase of runtime after a commit in May. It is caused by the introduction of supporting concurrent access on the the same table by different transactions.

Once our TPC-C and TPC-H benchmarks run and the first query optimizations are implemented we hope for different looking runtime graphs with a trend to decreasing runtimes.


In the next blog post we will discuss how to gather and use statistics about tables and columns to improve query performance.