Where Has All My Run Time Gone?
Introduction
Developers are always looking to improve the performance of applications. Usually, the first step in this process is understanding the performance characteristics of the application and the most common tool for that job is a CPU Sampler. A CPU Sampler, in general, observes a running application, periodically takes a snapshot of the call stack, and summarizes this data to approximate how much time was spent in which part of the application — the more times a particular function is seen on the call stack, the more application run time was spent in that particular function.
GraalVM comes with a built-in CPU Sampler for all languages built on the Truffle framework. This CPU Sampler is completely language-agnostic and will also work with any third-party language implementation built using the Truffle API as well as in a polyglot environment where Truffle languages are combined during execution.
In the 21.3.0 release, this CPU Sampler received a major rework and the remainder of this blog post will demonstrate the features of the CPU Sampler, as well as the improvements introduced in the rework.
An Example
Let’s consider a small JavaScript application contained in a file called primes.js as a running example. This application verifies that two approaches to calculating prime numbers produce the same results for the first 1000 prime numbers. One approach is completely naive — checking each consecutive number for divisors — and the other is based on the Sieve of Eratosthenes algorithm.
Let’s use the CPU Sampler to get an overview of how the run time is spent.
The CPU Sampler has several different ways to output the profile of an execution. A convenient starting point, and the default output, is the histogram view printed in the command line interface. In the 21.3.0 release, this view was simplified with common use cases in mind, while adding more complex options for digging deeper into the details.
The following is the default histogram output for our running example:
Looking at the Total Time
column shows us how much time was spent in each function, including all of its callees. We can see that we spend over 95% of the time in findPrime
— the naive prime calculator — and only 4.3% in next
— the Sieve of Eratosthenes approach. Since both of these functions aim to do the same, i.e. calculate primes, we can conclude that the approach based on the Sieve of Eratosthenes algorithm is over 20x more efficient than the naive one. We can see in the Self Time
column which of the individual functions took up which percentage of the run time. Almost all of the run time was spent in the isPrime
function. Understanding why we spend so much time in that one function is the key to improving the performance of this program.
We can see in the source code that the isPrime
function is called in 2 places — once to find the prime number using the naive approach (line 15), and once more to verify that the prime calculators actually produce a prime number (line 69). Since most of our run time is spent in that function it might be informative to know how much each of those call sites are contributing to the run time usage. We can do this by using the call tree output of the CPU Sampler:
In this view each entry represents a call site rather than aggregated function information. The indentation of the Name
column encodes the tree structure of the output. We can see that the call to isPrime
from the findPrime
function is the one that consumes most of the run time, and the call from :program
is negligible.
Investigating the findPrime
function we can see that it calculates the i
-th prime. On the other hand, the Sieve of Eratosthenes approach caches the last prime and only calculates the next one. There is a substantial difference between the approaches as we compare the first 1000 primes in order, meaning that the Sieve of Eratosthenes approach has a great advantage. Modifying our original source so that the findPrime
function takes the value of the last prime and finds the next one makes it a more even playing field. Let’s call this modified version primes-opt.js and let’s see the CPU Sampler histogram of that version:
We can see that this has a very significant impact on the execution time of the whole application, reducing the time spent in isPrime
from ~7s down to ~60ms. And looking at the total time we can see that the next
function of the Sieve of Eratosthenes approach now accounts for 85% of the run time, while the findPrime
function is only 7.5% — meaning that on even footing, the naive approach is over 10x faster. The reasons for this are left as an exercise to the reader — let us know in the comments down below.
A More Interactive Visualization
The CLI output we’ve seen so far is useful, but a bit static and not very visual — especially for large applications. The CPU Sampler does offer the option to dump the raw data as a json file which allows infinite processing and visualization options, but that requires additional effort which not everyone is willing to make. With GraalVM version 21.3.0 Duncan MacGregor introduced a new output mode for the CPU Sampler — the flame graph. This output produces a SVG file containing a visual and interactive flame graph — a common profile visualization. A flame graph is essentially a stacked representation of the call tree, meaning that each entry’s width represents how much time was spent in that function relative to its caller, which is the entry below it. This is akin to the call tree output we’ve seen before as it gives call-site-specific counts, which can show if time is being spent more at particular call sites. For convenience, the visualization also contains a histogram view below the flame graph. You can get a flamegraph of your execution by adding --cpusampler --cpusampler.Output=flamegraph --cpusampler.OutputFile=profile.svg
to your command line.
The following is a flame graph produced for the popular Ruby benchmark Optcarrot:
Feel free to open the file in your browser and explore the interactive aspects. The visualization allows you to zoom into individual functions, as well as cycle through three different color schemes which show functions, compilation info, and language info.
VisualVM and Chrome Inspector Integration
The CPU Sampler provides an API that other GraalVM tools can use to expose the data it provides to other applications. Two such examples are the VisualVM and Chrome inspector integrations as shown in the screenshots below. Both of these tools leverage the integrated CPU Sampler to get the profile and then present it in a familiar way to the end user. This not only gives more options for end users but the shared core of the data gathering, which is the CPU Sampler, means that all improvements to that core are implicitly improvements to these tools as well.
The Impact of the JIT
Since we are working with JIT-compiled languages, we might need to consider questions relating to the impact the JIT compiler has on performance such as: which functions were compiled? how much time was spent in compiled code? etc. Since GraalVM 21.1.0 introduced multi-tier compilation for Truffle languages we could also ask — in which compilation tier was a function compiled? All these questions can be answered by adding --cpusampler.ShowTiers=true
to our original example command line. This will result in the following output.
This output shows what percentage of time each of the functions spent in which of the compilation tiers — quite useful for understanding the impact the JIT compiler has on performance. In this particular case, we can see that all the impactful functions are being executed most of the time (>98%) in code compiled by the tier 2 compiler — the highest level of optimization available.
Under the Hood
Prior to the 21.3.0 release of GraalVM, the CPU Sampler was implemented using the Truffle instrumentation framework and worked by keeping a shadow stack during the execution of a program and periodically sampling the shadow stack to gather call stack data. This was a straight forward implementation of a CPU Sampler which had one major drawback — overhead. Keeping the shadow stack up to date meant manipulating an additional stack data structure on every function call and exit which introduced significant overhead.
In GraalVM 21.1.0, the notion of Truffle Safepoints was introduced. Truffle safepoints provide language and tool developers a way to submit thread-local actions which will be executed at a point where it is safe to interrupt the execution of the application. The CPU Sampler in GraalVM 21.3.0 is rewritten to use this new functionality, removing the need for a shadow stack and thus greatly reducing the CPU and memory overhead of using the CPU Sampler.
Conclusion
A CPU Sampler is a very common tool for understanding the performance characteristics of an application. GraalVM provides a language-agnostic CPU Sampler for all the languages built using the Truffle framework, and this tool was significantly improved in the 21.3.0 release.
One of the goals of GraalVM is to be an excellent language and tool development platform. This is strongly supported by the ability to develop and improve language-agnostic tooling such as the CPU Sampler. This is a massive force multiplier as any and all developments of a single tool are immediately available to a whole group of languages, as well as future languages to come, at no extra effort to the individual language developers. We look forward to providing new and improved tooling support for GraalVM.
You can find more information about the CPU Sampler here, more information on how to implement your own tool for GraalVM here, and if you are interested in doing an internship with the GraalVM team take a look at the internship program.