Reducing Hyrise’s build time

Markus Dreseler
Hyrise
Published in
6 min readOct 14, 2019

Over time, we have identified a number of factors that directly influence our developers’ productivity. These cover areas such as code style, test coverage, documentation, and iteration speed. Today, we want to look at the last factor. With iteration speed, we mean the time spent from making a change to seeing its impact on Hyrise’s performance. This iteration speed is composed of a number of factors:

t(iteration) = t(compiler) + t(tests) + t(benchmark) + t(typing)

As the compile step, the tests, and the benchmarks can be executed with a single command, we consider the time spent actually typing these commands to be close to optimal. The execution time of the tests and the benchmarks should be subject of another post. Let us focus on the time spent compiling (actually: compiling and linking) Hyrise.

Currently, building Hyrise from scratch takes 940 seconds, or almost 15 minutes on a 2016 Macbook Pro. An iterative build after changing a single cpp file (here: optimizer.cpp) still takes 132 seconds. When more complex cpp files or even header files are changed, this will take longer. My estimation is that many code changes take less time than that, meaning that we waste more than half of our time waiting for the build process. Let us have a look at how this can be improved:

Linking jemalloc as an interface

We ship jemalloc as a git submodule rather than depending on the system’s allocator or a pre-installed jemalloc version because it gives us more fine-grained control over the version and compile flags. The benefits of jemalloc are shown in the original pull request, others are seeing similar results [1].

As jemalloc is built using autoconf (an issue suggesting a move to cmake is approaching its fourth birthday), we use externalproject_add to build the binaries and target_link_libraries(hyrise PUBLIC jemalloc) to add it to the Hyrise project.

During the build process, we can see ninja waiting for the jemalloc build before starting to build any Hyrise files. This is unnecessary, as none of the Hyrise files directly requires the jemalloc to be finished.

By changing the link type to INTERFACE, the hyrise project can be built independently. Only once the binaries (such as hyriseServer or hyriseConsole) are linked, jemalloc is required.

As this improves the parallelization of the build process, it shaves off 63 seconds.

Building with -O0

Had you asked me before what the point of -O0 is, I would have had no answer. I always expected the compiler and the linker to default to no optimization. For gcc, this is the case, and is explicitly mentioned in the man page:

-O0 Reduce compilation time and make debugging produce the expected
results. This is the default.

For clang on OS X, -O0 does not seem to be the default as explicitly setting it makes a significant difference for the link performance. On my machine, it reduces the overall build duration by 117 seconds. I do not have a good explanation for that, yet. There are some sources that suggest -O2 being the default for clang, but no definite answer in the man page.

Unity builds

Calling the compiler independently for each cpp file has a number of drawbacks. By grouping multiple cpp files in a single unity (aka. jumbo or SCU) file, build times can be reduced. If you are interested in the reasons for this, have a look at this blog post.

We had a look at unity builds before but decided against it, because the available methods (see the How to maintain section in the linked blog post) all looked too intrusive. As most of our developers are not full-time developers, we do not want to introduce additional development steps or additional complexity to our cmake files.

Luckily, cmake introduces native unity builds with the upcoming 3.16 version. In theory, all that is needed is to add -DCMAKE_UNITY_BUILD=On to the cmake command line. For us, a minor cleanup of the code base was necessary, removing ODR violations from anonymous namespaces, and disable unity builds for third-party modules.

By doing that, we reduced the OS X compile time by 522 seconds. However, there are some caveats: First, as the compilation units become bigger, minor changes to the code trigger a bigger rebuild. This makes incremental builds (for the purpose of this article, changing optimizer.cpp) 11 seconds slower.

Second, when looking at release builds, most of our time is spent in a handful of operators that are heavily optimized. Without unity builds, all scan implementations are built in parallel. If a unity build groups all of these in a single compilation unit, the compile time goes through the roof.

As the build speed of a release build is dominated by the optimization phase, which does not profit from unity builds, we are using unity builds only for debug builds.

Precompiled headers

To drill deeper into the compilation cost, we used clang 9’s new -ftime-trace feature, which allows us to see where the compiler spends its time:

When compiling with -ftime-trace, the resulting json files can be opened in chrome://tracing and analyzed for the most expensive steps in the build process.

We found that a handful of Hyrise headers are not only included in virtually every compilation unit, but are also taking multiple seconds per include. For example, all_type_variant.hpp, the file that defines our data types, instantiates the AllTypeVariant, a boost::variant that can hold each of our supported data types. Including boost and instantiating that data type should only be done once instead of once per cpp file (or unity file).

Another great feature of the upcoming cmake 3.16 version are transparent precompiled headers. By pre-compiling the most expensive header files, we can save 12 seconds for incremental builds.

As a pre-compiled header that is too big is also bad for performance, we hand-picked the header files that are used most commonly. For the future, we wish for a tool that automates the process of looking at time traces and selecting the most expensive includes.

Linking Hyrise as a shared library

Currently, Hyrise is defined as add_library(hyrise STATIC …), creating a libhyrise.a file, which is then used by the different binaries. Now that we improved the performance of the other build steps, we noticed that building this archive file took several seconds. This is because all previously generated object files have to be copied. Additionally, each binary performed the same linking steps for that archive.

To improve this step, we changed from a static to a shared library, now producing libhyrise.dylib (or libhyrise.so on Linux). This way, we do not have to copy .o files and perform the linking only once. As all performance-critical work is done within the hyrise library, the costs of dynamic loading are insignificant. The savings of 36 seconds for an incremental build, however, are.

Using lld as the linker under Linux

The standard GNU ld linker is quite slow. Under Linux, there are two alternatives: GNU gold and llvm’s lld. In the last years, the latter has improved significantly. Thanks to multi-threaded linking and other optimizations, it is often faster than ld by multiple factors. In our case, it reduces the cost of incremental debug builds from 26 to 8 seconds.

Unfortunately, I did not manage to get it to work on Mac. On the clang mailing list, it is labeled as “bitrotted”, which is why we stick with ld64 for OS X for now.

Evaluation

The image on the left shows the reductions in the runtime of the build process. It shows benchmarks for two systems, my Macbook Pro 2016 (8 logical cores @ 2.9 GHz, 16 GB RAM) and one of our servers (224 logical cores @ 2.5 GHz, 2 TB RAM).

Most importantly, the time for an incremental build on a local was reduced by 90%. Also, the initial build time was reduced by more than half, making the initial Hyrise experience more pleasurable and reducing the cost of large rebuilds due to header changes.

For full builds, unity builds make the biggest difference, shaving off two thirds of the build time on my Macbook with eight logical CPU cores. On our server, the effect is not as visible, simply because you do not have to worry too much about the compiler repeating work if you have 224 logical cores.

Incremental rebuilds of release builds being faster than debug builds might be surprising at first, but considering that the file touched for triggering a rebuild (optimizer.cpp) was not expensive itself, it can be explained by the lack of debug symbols and the smaller file sizes.

References

[1] Durner et al.: Experimental Study of Memory Allocation for High-Performance Query Processing, ADMS 2019

--

--