Multi-Tier Compilation in GraalVM

Published in

graalvm

19 min readApr 20, 2021

Introduction

In this blog post, we will investigate the performance benefits for multi-tier compilation in GraalVM Truffle languages such as Graal.js, TruffleRuby and Java on Truffle, and will take a brief look-under-the-hood to understand how the multi-tier mode works.

We will also talk about multi-tier compilation for Truffle languages on April 22nd, 5 PM CET, at twitch.tv/thomaswue.

GraalVM 21.1 introduces a new feature called multi-tier compilation for languages implemented on Truffle. The multi-tier mode improves the warmup behavior, and is especially beneficial for programs that consist of large codebases, improving the startup times by 30%-50%. The core idea in the multi-tier mode is to separate compilation of the call targets into two tiers — the faster first-tier, and the slower second-tier. First-tier compilations perform fewer optimizations, but are quickly available and thus help the program to start-up faster. Second-tier compilations are slower because they do more optimizations, but they ensure that the program eventually reaches peak performance. As of the 21.1 release, the multi-tier compilation in GraalVM is turned on by default.

There are several experimental options that control the behavior of multi-tier compilations. Usually, the default values work best, but the more adventurous users may want to experiment with these options on their workloads:

MultiTier — this option enables multi-tier compilation. It is true by default since GraalVM 21.1.
FirstTierCompilationThreshold — the number of calls or loop iterations executed in a call target before it gets scheduled for first-tier compilation.
LastTierCompilationThreshold — the number of calls or loop iterations executed in a call target before it gets scheduled for second-tier compilation.
FirstTierInliningPolicy — determines which inlining policy should be used in first-tier compilations. By default, it is set to TrivialOnly in GraalVM 21.1, which means that only small methods will be inlined.
FirstTierUseEconomy — this option determines if the fast compilations should use a minimum amount of optimizations. It is true by default.
FirstTierBackedgeCounts — this option determines whether the loop iterations should be counted towards the second-tier compilation threshold. It is true by default.
SingleTierCompilationThreshold — this option determines the compilation threshold when multi-tier compilation is turned off. It has the same effect as CompilationThreshold, which is now deprecated — it was effectively renamed to this option.

Improvements in startup times

To demonstrate the benefits of multi-tier compilation, we first introduce the concept of a warmup curve. The program tends to get increasingly faster the longer the same workload within the program runs, because the code of the program is not compiled initially, and it gets compiled while the program is running in the VM. The sooner the program gets faster, the better the warmup. A warmup curve is a way to characterize how the program’s execution speed changes over time. In very general terms, a warmup curve plots how a performance metric changes with each repetition of the program. In most of this post, the performance metric will be the total execution time since the VM started. The lower the total execution time, the better the warmup.

Let’s see an example of a warmup curve. To plot a warmup curve, we will need to select a program in which we can output the elapsed time periodically, after a well-defined amount of work was executed. One way to do this is to pick a benchmark, and repeat it N times — after every repetition, we record the total elapsed time. When analyzing the benefits of multi-tier compilation, we are interested in programs that have a relatively large amount of code, in which warmup takes longer because GraalVM must compile more call-targets. So, we’ll use a few larger benchmarks from the Web Tooling Benchmark Suite, and run them on Graal.js, the JavaScript engine that runs Node.js code. We modified this benchmark suite to collect the total elapsed time since the start of the suite.

We’ll start with the babel benchmark from Web Tooling, and repeat it 10 times. We will plot the total time that elapsed from the start of the VM until the end of the nth repetition. We will repeat this experiment twice — once without multi-tier compilation, and once with multi-tier compilation. This can be achieved with the following pair of commands (note, the “-i” parameter, which tells the harness how many times the benchmark should be repeated before exiting the VM):

<graalvm-home>/bin/node — experimental-options — engine.MultiTier=false src/cli.js -i 10 -b babel
<graalvm-home>/bin/node — experimental-options — engine.MultiTier=true src/cli.js -i 10 -b babel

The resulting plot consists of two warmup curves, and is shown in the following figure:

In the multi-tier execution, the program spends about 33% less time to reach the end of the first iteration, and about 43% less time to reach the end of the 10th iteration. The plot also shows that the two warmup curves diverge — on the babel benchmark, the fast first-tier compilations are quicker to reduce the slope of the warmup curve, meaning that it takes less time to complete each iteration. The single-tier compilations ultimately also gradually reduce the slope of the curve (as is visible after the 7th iteration), but this takes a longer time.

When a warmup curve represents the total elapsed time after every repetition of the workload, the slope of the curve represents that speed with which the program executes — the lower the slope, the faster the program.

We can repeat this experiment for a few other benchmarks that are considerably large in terms of code-size. In the espree and the esprima benchmarks, the slope of the warmup curves for the two executions does not significantly diverge. This indicates that (after the initial delay) the single-tier execution reaches peak performance relatively soon on these particular benchmarks, which furthermore implies that these benchmarks have less code than babel. However, the multi-tier execution gets ahead already during the 1st iteration, and the absolute time difference stays the same thereafter.

Generally, the specific performance gain of multi-tier compilation will depend on several factors, including the characteristics of the program, and the CPU that you run the VM with — your mileage will vary. For example, in the acorn benchmark, the difference between the two warmup curves is less pronounced:

“One-Shot” Metric

You may have noticed that the warmup curve holds a lot of information about how the program executes: the time after each iteration, the rate at which the speed of the program changes (i.e., the slope of the warmup curve), or potential spikes in the execution speed, as we will see later. All this makes it very difficult to compare different runs of the VM — we sometimes want to restrict ourselves to a metric that captures a specific aspect of the warmup, and ignore everything else. For example, we can define the “one-shot” metric as the total time it takes to complete the 1st repetition of the benchmark (this is the y-value of the first point on the warmup curve).

The “one-shot” metric is the amount of time between VM start and the end of the 1st repetition of the workload.

The “one-shot” metric is particularly interesting, because it captures the dynamics at the start of the execution. In many practical scenarios, we are interested in short-running programs such as command-line tools or single unit-test executions. In other scenarios, we expect that the program achieves decent execution performance as soon as possible, even if it achieves the best possible peak performance much later.

The following figure shows the comparison of the “one-shot” metric with and without multi-tier compilation for the acorn benchmark from the previous section.

All of the previous warmup curves were produced on a Ryzen 5950x CPU with 16-cores, set to a fixed 2.8GHz frequency. We also executed these benchmarks on a dual-socket Xeon E5–2699 v3 configuration. The latter machine has 44 CPU cores, so GraalVM creates a larger set of compiler threads. Consequently, more first-tier compilations are executed during the first iteration, some of which turn out to be performance-critical, so the gap between single-tier and multi-tier compilation mode is larger on the dual Xeon compared to the Ryzen machine.

Other GraalVM languages

Having defined the concept of the “one-shot” metric, let’s now examine the warmup behavior of several other GraalVM languages by looking at the “one-shot” metric instead of the warmup curve.

The Ruby Asciidoctor library converts text in the AsciiDoc format into PDF and other formats — the asciidoctor-convert benchmark converts a large body of AsciiDoc text. The following plot shows the “one-shot” of TruffleRuby with and without multi-tier compilation, compared to the JRuby implementation that uses the invokedynamic bytecode instruction.

On the “one-shot” metric, the multi-tier execution is about 33% faster than the single-tier execution, and 60% faster than JRuby. A smaller benchmark (in terms of total code size) called asciidoctor-load-file, which consists only of loading a file using Asciidoctor, is shown next. Here the “one-shot” difference is smaller, but still noticeable.

Finally, we examine a larger benchmark called rubykon, in which a bot plays the game of Go. It was previously demonstrated that, while TruffleRuby has the best peak performance out of all Ruby implementations, it also has a particularly bad warmup. Below, we show that TruffleRuby’s warmup performance improved considerably, and based on our measurements, TruffleRuby is now at first place — the “one-shot” metric with multi-tier TruffleRuby is ~2.75x faster than with MRI, and 2.4x faster than JRuby.

Next, we examine the startup of several benchmarks from the Scalabench suite, on the Espresso engine — the implementation of Java using GraalVM’s Truffle framework. The scalac benchmark is particularly large in terms of code-size, and multi-tier compilation reduces the “one-shot” time by ~41%.

On the scaladoc and the scalaxb benchmarks, multi-tier compilation reduces the “one-shot” metric value by ~39% and ~45%, respectively, as can be seen in the following plots.

As part of GraalVM development, we regularly track warmup performance on these benchmarks (and a few others), and we use them to optimize multi-tier compilation. In the next part of the post, we briefly cover the various aspects in which we improved warmup in GraalVM.

Look under-the-hood

There are many factors that contribute to warmup performance. To see this, we should first briefly consider what it means that the program is fully “warmed up”, or in other words, that it reached peak performance. One way to define the point at which the program is warmed-up is to mandate that no subsequent compilation will improve the performance of the program any further. Here is where the peak point is on the acorn benchmark, shown on a warmup curve that tracks the the time of the nth iteration (this is the derivative of the elapsed-time warmup curve):

As you can see on this warmup curve, the time required to complete the repetition of the benchmark gets successively smaller and smaller. Occasionally, there are some bumps (typically caused by GC), and the repetition time goes up a bit, but the general trend is downwards. After the peak, the time required to complete a repetition is thereafter never reduced.

The factors that contribute to warmup performance are then everything that affects the program’s execution up until the peak. These factors determine the shape of the warmup curve and the position of the peak point. Let’s take a look at some of them.

Interpreter impact

By definition, in the execution before the peak point, some of the code is already compiled by the JIT compiler, and some of it is still executed by the interpreter. Super-imposed on the previous plot, the time distribution between the interpreted and the compiled code could look as follows (note that this distribution is somewhat hard to measure, and the following is just a rough guess):

Although the interpreted-vs-compiled time in the previous figure is just a rough guess, one thing is fairly certain — the time spent in compiled code does not exceed the time at the peak (this generally holds assuming that the set of compiled code only grows over time). Obviously then, the performance of the program during the very early parts of the warmup is dominated by the time spent in the interpreter.

Improving the interpreter performance pushes the points of the repetition-time warmup curve downwards, and the peak point to the left.

Therefore, in the past months, the GraalVM team focused on improving the interpreter-performance of the individual languages. We tracked the performance of interpreter-only execution mode on several benchmarks, and across several GraalVM languages. The most notable improvements were made on TruffleRuby, whose performance was improved by around 20–25% on the deltablue and richards benchmarks. The following plots show the daily interpreter-benchmark runs since January 22nd, 2021.

On GraalJS, the interpreter improvements were a bit smaller, but still noticeable. The following plot shows the performance changes on the Sieve-of-Eratosthenes benchmark since December 2020.

While most of the other interpreters are relatively mature, GraalWasm is the youngest addition to the GraalVM family. In the last year, there were several obvious improvement opportunities for the GraalWasm interpreter. The following plot shows the performance changes on the WebAssembly version of the Sieve-of-Eratosthenes benchmark. We can see two sharp improvements in November 2020 — but, more details about that in another blog post.

Compilation time

Another factor that plays a role in the warmup dynamics is the compilation time. The faster the compiler is, the earlier the program reaches the peak performance.

Improvements in the compilation performance move the points on warmup curve, including the peak point, to the left.

Generally, there are two ways to make an optimizing compiler faster. The first approach is to inspect the compiler implementation, and to improve places in the code at which the compiler performance is sub-optimal. One can optimize any program, assuming that the program is sub-optimal. The second approach is to reduce the amount of optimizations that the compiler does. While this makes an optimizing compiler faster, it also reduces the peak performance. Thus, while the peak point moves to the left, making the compiler faster by removing the amount of optimizations may also move it upwards. This is demonstrated by the warmup curves on the following plot, where we ran acorn twice, once with the Graal compiler that uses the full set of optimizations (orange), and then once again with the reduced set of optimizations (blue). The blue curve reaches the peak sooner, but the peak-performance is sub-optimal.

So, here’s how we could combine the benefits of both execution modes, and achieve a better warmup curve — all of the code should initially be compiled with fewer optimizations, to make the compilation faster, and to escape the interpreter mode as soon as possible. We call this the first-tier compilation. However, the code should later be compiled with more optimizations, to reach better peak performance — this is called the second-tier compilation. That way, we push the warmup curve downwards when the program starts, but the peak point is not pushed upwards. The mechanics of when and for which code to switch from the first-tier compilation to the second-tier compilation is subtle, but that’s the basic idea.

In our current implementation of Truffle multi-tier compilation, first-tier compilations have a lower compilation threshold, they perform almost no inlining of other functions and methods, and the compiler backend performs only the most crucial optimizations and transformations. To estimate the compilation time spent in first-tier compilation, we’d need to disable second-tier compilations. Here’s a trick to do that — we can set the second-tier compilation threshold to a very high value. Here’s an example command that achieves this (note that we set the repetition count to 200, because we know that all call targets in the program will be compiled long before that — we traced the compilations to confirm that):

<graalvm-home>/bin/node — experimental-options — engine.MultiTier=true — engine.LastTierCompilationThreshold=1000000000 src/cli.js -i 200 -b acorn

The effect of setting the last-tier compilation threshold to such a high value is that the second-tier compilations are never triggered — reaching a call-count this high is uncommon in large programs, and if the call-count of some call target occasionally reaches this value, it should be just noise in most cases.

In addition to that, we will use the -Dgraal.Time option to dump the compilation-time breakdown on VM exit:

<graalvm-home>/bin/node — experimental-options — engine.MultiTier=true — engine.LastTierCompilationThreshold=1000000000 — vm.Dgraal.Time= src/cli.js -i 200 -b acorn

The entries PartialEvaluationTime_Accm and GraalCompiler_Accm determine how much time was spent in Truffle’s partial evaluation, and how much time was spent in the Graal compiler itself. The sum of these two times is roughly the total compilation time. We will also collect the data for second-tier-only and multi-tier runs:

<graalvm-home>/bin/node — experimental-options — engine.MultiTier=false — vm.Dgraal.Time= src/cli.js -i 200 -b acorn
<graalvm-home>/bin/node — experimental-options — engine.MultiTier=true — vm.Dgraal.Time= src/cli.js -i 200 -b acorn

From the previous plot, it is obvious why the reduced-optimizations warmup curve reaches peak sooner — less time is spent in compilation overall, meaning that more call targets get compiled in the short period after the program runs. Moreover, what’s interesting from the previous plot is that multi-tier compilation time is slightly smaller than the time spent in the second-tier-only mode. If, in the multi-tier mode, the code is compiled twice (once in first-tier, and then again in second-tier), then you’d expect that the total time is equal to the sum of the first two bars, right? Well, that’s not true in this case, because the second-tier compilation threshold in multi-tier mode is much larger than in the second-tier threshold in non-multi-tier mode. This means that, in multi-tier mode, the code gets compiled in the second-tier only if it is substantially more hot than it would have to be in non-multi-tier mode. Consider why this is the case. The second-tier-only compilation has to remove all the call targets from the interpreter as early as possible, even if efficiently compiling each of those call targets is not that important for peak performance. Thus, the second-tier-only compilation will spend excessive amounts of compilation budget even for call targets that are only initially hot for a brief period (such as initialization code), but for which expensive optimizations bring little benefit. In the multi-tier mode, the first-tier compilations already do the job of removing the call targets from interpreted execution, so the second-tier (i.e. optimization-heavy) compilations usually only need to focus on call targets that are really invoked often. Thus, the same peak performance was achieved on acorn, but it turned out that the compilation time is even smaller (note that this effect depends on the specifics of the benchmark).

Let’s now consider the elapsed-time warmup curve for acorn, for these three variants. We can see that first-tier-only compilations (blue) reach reasonable performance very fast, but their slope does not improve thereafter. Second-tier-only compilations (grey) are much slower initially, but they eventually reach a better peak, and surpass first-tier-only. Multi-tier combines the best of both worlds — they are both initially fast, and they reach a better peak.

The throughput warmup curves are shown in the following plot (each of these curves is an inverse of the derivative of the corresponding curve from the previous plot).

The throughput warmup-curves are generally a bit bumpier than the elapsed-time curve, because each point represents the speed of the current repetition, which may vary. Here is a zoom on the first 9 iterations. Part of the bumpiness in the multi-tier curve is caused by deoptimizations of the first-tier-compiled code, which must be replaced by second-tier-compiled code. Nevertheless, we can see that multi-tier approaches peak faster, especially early on.

Queuing policies

Another factor that contributes to the shape of the warmup curve is the order in which the compilation requests are handled. At the beginning of the program, the interpreter executes a lot of call targets, and they tend to get pushed to the queue faster than the compiler threads can respond to these requests. The queue thus grows — when this happens, the compiler thread has to select the requests in a smart way. As a general rule of thumb, the compiler thread must pick the call target for which it expects that most time will be spent in the interpreter in the future (to remove it from the interpreter-time balance as soon as possible). To see why this is important, consider two possible queuing heuristics: in the first, the thread manages to guess the call target in which most interpreter time will be spent. In the second, the compiler thread always picks the least important compilation request. Obviously, much more time is spent in the interpreter in the second case, so the two corresponding warmup curves may look like this:

The “worst” policy leads to the warmup curve that initially only slightly improves throughput, and handles the most important call targets only at the end. An unwanted side-effect is also that while the “worst” policy slowly works its way through the request queue, more and more less important call targets reach the threshold, get pushed to the queue, and cause an additional slowdown by postponing the compilation of useful call targets. This extreme example illustrates why it is important to first compile call targets in which most time is spent. Ignoring all other factors, an effective queuing policy can make the warmup curve more concave.

The queuing policy regulates the “shape” of the warmup curve. It may move the peak point to the left or to the right.

As part of our work on multi-tier compilation, we did several changes to the compilation-queuing. First of all, the compilation requests are no longer served in the order in which they arrive — they are prioritized to handle the compilation requests of the first-tier ahead of second-tier compilation requests. Furthermore, requests are prioritized within the respective tiers according to their call-count and call-count growth rate — the hotter the code, the sooner it should be compiled, so even if something arrives at the compile queue later, it may get compiled sooner.

Another feature that proved useful is to dynamically scale compilation thresholds. As an experimental feature, compilation thresholds can get automatically adjusted depending on the size of the compilation-request queue.

However, compilation-queuing is a complex topic altogether, so more about that in a different blog post!

Community impact

Since warmup is especially noticeable when running UI applications, we asked Fabio Niephaus, an active contributor of the GraalVM project, for feedback on the new multi-tier compiler option. As the creator of TruffleSqueak, he had the following to say:

The Graal compiler provides great peak performance in JVM mode, but warmup can be slow for some applications. In TruffleSqueak, we can experience this sometimes directly because TruffleSqueak is a Smalltalk-based IDE implemented as GraalVM language. Poor warmup can cause noticeable stuttering or lag within the user interface as different parts of TruffleSqueak are executed for the first time. We documented this effect in our MPLR’19 paper using two UI benchmarks (see Figure 5).
TruffleSqueak is kept up-to-date with the latest releases of GraalVM and we were excited to learn that the GraalVM team is working on a tiered compilation feature for GraalVM languages. To provide some feedback on this, we re-ran our BouncingAtomsMorph UI benchmark from the paper, a simple gas tank simulation that warms up the Smalltalk UI framework. Here’s a screenshot of what that looks like:

And, here are the results:

The plot shows two runs of the same prepared benchmarking environment: a run with and a run without the new multi-tier option (orange vs. blue). The two darker lines visualize the frame rate over time (y-axis on the left), while the two brighter lines show the corresponding sizes of the compilation queue (y-axis on the right).
The frame rate of the environment running the simulation increases noticeably quicker when tiered compilation is enabled and the compilation queue is emptied significantly faster. As an expected tradeoff, it takes a bit longer to reach peak performance. Interestingly, the frame rate gets very close to peak performance after around 90s and then drops just a bit below the rate of the run without the multi-tier option. At the same time, the compiler’s queue is filled with new compilation tasks. According to the compilation traces, this is in fact due to the tier 2 compilation kicking in and replacing methods compiled by tier 1. This might also explain the drop in the frame rate at around 90s. To keep things simple, note that the plot only shows data of two single runs, so the numbers aren’t necessarily statistically representative. We ran more than these two experiments and the overall results remained the same. If you’d like to reproduce our benchmark results, you’ll be able to evaluate TruffleSqueakUtilities setUpUIBenchmark in the next TruffleSqueak release for GraalVM 21.1.
We’ve been using the multi-tier option for a while now and feel that it generally reduces the load of the compilation queue. And more importantly, TruffleSqueak’s user interface feels smoother now as well.

Acknowledgements

Multi-tier compilation and the warmup improvements described in this post were a collaborative effort by a large number of people from the GraalVM team. Multi-tier compilation mode was initially implemented by Aleksandar Prokopec, and subsequently heavily tuned by Boris Spasojević, who also introduced and tuned the new queuing policies. Many contributed to this GraalVM feature, including Peter Hofer, Christian Humer, Roland Schatz, and Josef Eisl. The GraalVM interpreters were improved by Andreas Wöss, Daniele Bonetta, Benoit Daloze, Alfonso² Peterssen, and many others.