Having a closer look under the hood

Instrumenting code on demand via USDT

Toni Stachewicz
Hyrise
6 min readJul 19, 2018

--

In our first blog post, we introduced the instrumentation profiling technique which is used by some code profilers, for example callgrind or gprof. Since instrumentation instructions are generated and applied automatically, a programmer cannot control what to trace. Additionally, the scope is limited. A developer does not have the chance to examine code on a higher level, such as runtime of jobs. Wouldn’t it be great if one cannot only analyze method execution times but also single lines or local variables? And what about doing this only on demand?

The best thing is: There is a solution! It is called USDT (User-Level Statically Defined Tracing). Basically it is some sort of event notification system. The developer is able to set a marker (also called probe) at interesting places within the code. These markers get resolved to only a few lines of assembly code. Thus, the increase in execution time of the original software is negligible.

But there are more profits. The code is only executed on demand. If programmers want to analyze their application via USDT they have to write and run a script. Only in this case, the probes are executed and events are raised. Programs evaluating such probes are called tracers. Well-known exponents of those are dtrace or SystemTap. In our work we are focusing on the latter, since it has a better support for Linux operating systems (development servers run Linux). The tracer scripts we write get translated and executed by SystemTap. This can happen at any time: before you start your application or during the run. For example, one wants to know which SQL queries are currently executed in the system. You just need to run a script. This does not require a restart of the application. As soon as you are done with your work, just detach the tracer. It is really easy!

Let us have a look at an example where we used USDT in order to find a bottleneck.

USDT in Hyrise

Some database operations can run in parallel, such as joins or table scans. Therefore, jobs are created, which are then executed by the built-in scheduler. Since the rewrite of Hyrise is still in an early stage, the job creation is not optimal yet. Hence, we analyzed the multithreading performance of Hyrise. In the graphic below, you can see the impact of parallelization on each TPC-H query. As soon as there are more than 5 CPU cores used, the performance does not increase as expected. Some queries are even getting slower.

Graphic 1. Multi-threading behavior

In order to examine this strange behavior, we had a closer look at the jobs created by Hyrise. As we do not want to look at every single component that creates tasks, we decided to use USDT. The following code snippet shows how we added two probes to measure the time of jobs.

Code snippet 1. Job execution method containing USDT probes
Code snippet 2. Tracer script probing “job_start” event

Our script evaluating the probes is written in SystemTap language (cf. code snippet 2). When a new task starts, the trace script stores a timestamp when the probe was fired. Additionally, we print some information on the command line. When the job finishes, another probe gets fired. At this point, we take a second timestamp and calculate the time difference between them. As soon as the benchmark execution stops, we shut down our SystemTap session and let the script print a histogram containing the job duration. The visually prepared diagram of this output looks as follows:

Graphic 2. Execution times of all jobs

One might notice that there are a lot of jobs — 28.9% in relative numbers — having a runtime of less than ten microseconds. This is not what we expected. It might sound good that jobs finish quickly, but it is not. The process of finishing a task and scheduling a new one takes some time. A so called context switch on modern CPUs usually takes five to ten microseconds (highlighted by the red line in the diagrams) [Weaver]. In other words: The scheduling overhead is larger than the job’s execution time itself. A possible solution would be pooling of small jobs to a larger one. But where to start? There are several locations in the code where tasks are created.

Who is the evildoer?

The previous approach is not sufficient for our investigation, but it gives us a hint why multi-threading is not as efficient as expected. How can we now get the information which database operation created the job? The answer is simple: USDT! We don’t even need another probe in the code. SystemTap has a built-in functionality that is able to analyze the method call stack and thus, get its caller. The only thing we have to do is adding those instructions to our tracing script. You can find an updated version of the new tracer script below (cf. Code snippet 3).

We are now able to examine what created a specific task and how long its execution time is. The results can be automatically plotted in histograms. After another run of our benchmark, the bad guy was evident: The hash join implementation. The following graphic depicts the histogram for jobs started by this specific operator.

Code snippet 3. Updated SystemTap script analyzing a job’s caller
Graphic 3. Execution times of jobs created by hash join operator

Now we know that the hash join creates a lot of jobs which only need a few microseconds to execute. This is a good start to have a look into its implementation and check where jobs are created. Hyrise utilizes a radix hash join which has four phases. Each phase is parallelized and thus, creates jobs. More information about how the radix hash join works in Hyrise can be found here.

One spot where jobs are created is in the build phase. At this point values of the smaller column get hashed. Therefore, the input data is partitioned by a series of bits. Since the partitions are created before working with the actual data it might happen, that some are empty.

The following code snippet shows the jobs’ task. The values for the hash table are only created if there is at least one element in the current partition (cf. code snippet 4, highlighted box). It appeared that the partitions quite often don’t have elements during our benchmarks. Thereby, many jobs finish right after they started. This led to a high scheduling overhead. A simple solution is to only create a new job if the partition is not empty.

Code snippet 3. Job creation during hash table creation phase

What has changed after integrating this fix?

Due to the removal of jobs without any work load 32% fewer tasks are spawned within the hash join. The outliers in the histogram at one and two microseconds were eliminated (cf. graphic 4)

Graphic 4. Execution times of jobs created by hash join operator after optimization

Furthermore, the overall amount of jobs decreased by almost 19%. The rate of jobs having a runtime of less than ten microseconds has changed from 28.9% to 12.6 %.

In the future, we can use this approach to find and remove more jobs with a very short runtimes. Analogous, it helps to find jobs with very long runtimes. These jobs might be candidates for being split.

Interested in writing your own probes?

Here is a rough summary of the things that have to be done. First add a trace point to an interesting location in your code. The syntax is DTRACE_PROBEn(provider, probe_name, arg1, arg2, …, argn) with 0 <= n <= 12. Do not forget to include the SystemTap header file for statically defined traces to the respective file. Simply add #include <sys/sdt.h>. Now you can write you SystemTap script, that evaluates the probed event.

As of now, you are done with the setup. The tracing can begin! Compile your project, start your script via sudo stap <path/to/your/script> and run your binary. You will be amazed how easy it is!

For a more technical description I recommend reading our Hyrise wiki article about USDT. A setup tutorial, a detailed explanation of both, the probes design and the SystemTap scripts as well as additional examples are waiting for you there.

Happy tracing!

Adrian, Marcel and Toni

--

--