Query Optimization w/ TPC-DS

Introduction to the Hyrise master’s project — summer term 2019

Julian Menzler
Hyrise
3 min readJun 5, 2019

--

Around two months ago, in April 2019, a new summer semester began at the Hasso Plattner Institute in Potsdam, Germany. In line with this, many student projects started. At the research group of Prof. Plattner and in the context of Hyrise, a master’s project called “Parallelization and Query Plan Optimizations for the TPC-DS benchmark“ began.

We are a team of four master’s students participating in this project: Marcel, Pascal, Jonathan, and Julian. Over the coming months, we want to share our findings and insights with you. This blog post is the first one in a series.

The Project

Motivation

Hyrise is an open-source in-memory database serving as a research platform for students and doctoral candidates. Its codebase was rewritten in 2017 and has been under continuous development since, performance- and functionality-wise.

Our master’s project aims for performance tuning, just like previous ones (compare 2017, 2018). Using the industry standard TPC-DS benchmark, we want to improve on Hyrise’s query performance. In comparison to its predecessor TPC-H, already built into Hyrise, the SQL query workload is considerably more complex. Therefore, it should offer a lot of optimization potential.

Scope

As mentioned, Hyrise does include support for the TPC-H benchmark. To support the TPC-DS benchmark, however, we are going to adapt the existing benchmarking infrastructure accordingly.

Also, we need to work on Hyrise itself because not all TPC-DS queries execute straight away. In any case, to support the analysis and optimization part of the project, we want a majority of the TPC-DS queries to execute correctly.

In summary, the project focuses on the following activities:

  1. Implement the TPC-DS benchmark.
  2. Improve Hyrise and the Parser to support the majority of TPC-DS queries.
  3. Analyze the performance of TPC-DS queries & identify performance bottlenecks.
  4. Improve query performance based on analysis results.

Status Quo

We have been working on the project for about two months now and familiarized ourselves with the available resources.

We adapted Hyrise’s file-based benchmark to load the TPC-DS table data and TPC-DS queries from flat files. These files can be generated using the tools dsdgen (data generator) and dsqgen (query generator). As a result, we were able to run TPC-DS queries in Hyrise. Unfortunately, only 18 queries executed correctly. The following pipeline chart illustrates our three stages of success:

Query Processing Pipeline & TPC-DS implementation progress

The first stage for success is the SQL Parser, which analyzes query strings and converts them into data structures that can be processed by Hyrise.

Stage two is about execution: The SQLTranslator creates a logical query plan (LQP) from the parsing result. Following this, the Optimizer and the LQPTranslator transform the resulting LQP for efficient execution. Finally, multiple operators are responsible for executing the query.

On success, the execution result enters stage three, the verification: We compare the results against the results of a different database implementation to make sure our results are correct. Currently, we use SQLite as a reference database.

Next Steps

Before moving on to query analysis and optimization, we want to raise the number of correctly executed queries. For this purpose, we refine existing Hyrise functionality and implement some new SQL features like WITH or STDDEV_SAMP.

In parallel, we also work on a faster table loading mechanism that creates table data in-memory instead of reading from flat files. The latter becomes particularly useful concerning higher scale factors (larger tables).

Our current verification process is based on the TPC-H verification in Hyrise and uses SQLite as a reference implementation. However, since we cannot execute all 99 TPC-DS queries with SQLite, we aim for an alternative: The TPC-DS specification describes a verification process using predefined tables, predefined queries, and a reference answer set. Instead of running a separate database in parallel, we compare results against the reference answer set. Thus, incompatibilities between SQLite and TPC-DS do no longer matter.

One Last Thing

Our master’s project also has an academic dimension. That is why we have studied the TPC-DS benchmark very thoroughly. Our next blog article is going to feature an extensive summary of the benchmark, its history, and its context.

So stay tuned!

– Marcel, Pascal, Jonathan, and Julian

--

--

Julian Menzler
Hyrise

Interested in database engineering. Currently working at AWS Redshift in Berlin, Germany.