Profile Guided Optimizations for Native Image

Published in

graalvm

11 min readFeb 12, 2024

One of the advantages that JIT compilers have over AOT compilers is the ability to analyze the run-time behaviour of the application they are compiling. For example, HotSpot keeps track of how many times each branch of an if statement is executed. This information is passed to the top-tier JIT compiler (i.e. Graal) as information that we call a profile. The profile is a summary of how a particular method has been executed during run time. The JIT compiler then assumes that the method will continue to behave in the same manner, and uses the information in the profile to optimize that method better.

AOT compilers typically do not have profiling information, and are usually limited to a static view of the code they are compiling. This means that, barring heuristics, an AOT compiler sees each branch of every if statement as equally likely to happen at run time, each method is as likely to be invoked as every other and each loop will repeat an equal number of times. This puts the AOT compiler at a disadvantage — without profile information, it is hard to generate machine code of the same quality as a JIT compiler.

Profile Guided Optimization (PGO) is a technique that brings profile information to an AOT compiler to improve the quality of its output in terms of performance and size.

What is a Profile?

As mentioned earlier, a profile during JIT compilation is a summary of the run-time behaviour of the methods in the code. The same is true for AOT compilers, with the caveat that we have no runtime (i.e. no JVM) to provide this information to the compiler since compilation happens ahead-of-time. This makes the gathering of the profiles more challenging, but at a high level the content of the profile is very similar. In practice, the profile is a summarized log of how many times certain events happened during run time. These events are chosen based on what information will be useful for the compiler to make better decisions. Examples of such events are:

How many times was this method called?
How many times did this if-statement take the true branch? How many times did it take the false branch?
How many times did this method allocate an object?
How many times was a String value passed to a particular instanceof check?

How Do I Obtain a Profile of My Application?

When running an application on the JVM, the profiling of the application is handled by the runtime environment, with no extra steps needed from the developer. While this is undoubtedly simpler, the profiling that the runtime does is not free — it introduces execution overheads of the code being profiled — both in terms of execution time and memory usage. This causes warmup issues — the application will reach predictable peak performance only after sufficient time has passed for the key parts of the application to be profiled and JIT compiled. For long-running applications, this overhead usually pays for itself, yielding a performance boost later. On the other hand, for short-lived applications and applications that need to start with good, predictable performance as soon as possible, this overhead is counterproductive.

Gathering a profile for an AOT-compiled application is more involved, and requires extra steps by the developer, but introduces no overhead in the final application. Here, profiles must be gathered by observing the application while it is running. This is commonly done by compiling the application in a special mode that inserts instrumentation code into the application binary. The instrumentation code increments counters for the events that are of interest to the profile. We call this an instrumented executable, and the process of adding these counters is called instrumentation. Naturally, the instrumented binary of the application will not be as performant as the default build due to the overhead of the instrumentation code, so it is not recommended to regularly run instrumented binary in production. But, executing synthetic representative workloads on the instrumented build allows us to gather a representative profile of the application (just as the runtime would do for the JIT compiler). When building an optimized binary, the AOT compiler has both the static view and the dynamic profile of the application — a profile-guided-optimized binary performs better than the default AOT-compiled binary.

How Does a Profile “Guide” Optimization?

Compiler optimizations often have to make decisions during compilation. For example, in the following method, the function-inlining optimization needs to decide which call sites to inline, and which not.

private int run(String[] args) {
    if (args.length < 3) {
        return handleNotEnoughArguments(args);
    } else {
        return doActualWork(args);
    }
}

For illustrative purposes, let’s imagine that the inlining optimization has a limit on how much code can be generated, and can hence only inline one of the calls. Looking only at the static view of the code being compiled, both the doActualWork and handleNotEnoughArguments invocations look pretty much indistinguishable. Without any heuristics, the optimization performing the inlining would have to guess which is the better choice to inline. However, making the incorrect choice can lead to code that is less efficient. Let's assume that run is most commonly called with the right number of arguments at run time. Then inlining handleNotEnoughArguments would increase the code size of the compilation unit without giving any performance benefit since the call to doActualWork needs to still be made most of the time.

Having a run-time profile of the application can give the compiler data with which differentiating between which call should be inlined is trivial. For example, if our run-time profile recorded this if condition (args.length < 3) as being false 100 times and true 3 times - we probably should inline doActualWork. This is the essence of PGO - using information from the profile i.e. from the run-time behaviour of the application being compiled, to give the compiler grounding in data when making decisions. The actual decisions and the actual events the profile records vary from phase to phase, but the preceding example illustrates the general idea.

Notice here that PGO expects a representative workload to be run on the instrumented build of the application. Providing a counter-productive profile — i.e. a profile that records the exact opposite of the actual run-time behaviour of the app — will be counter-productive. In our example this would be running the instrumented build with a workload that invokes the run method with too few arguments, while the actual application does not. This would lead to the inlining phase choosing to inline handleNotEnoughArguments reducing the performance of the optimized build.

Hence, the goal is to gather profiles on workloads that match the production workloads as much as possible. The gold standard for this is to run the exact same workloads we expect to run in production on the instrumented build.

A Game Of Life example

To understand the usage of PGO in the context of GraalVM native image, let’s consider an example application. This application is an implementation of Conway’s Game of Life simulation on a 4000 by 4000 grid. Please note that this a very simple and not-illustrative-of-the-real-world application, but it should serve well as a running example. The application takes as input a file specifying the initial state of the world, a file path to output the final state of the world to, and an integer declaring how many iterations of the simulation to run.

The entire source code of the application can be found in GameOfLife.java, and it’s here as a reference for reproduction of results. Feel free to skip past it, as there is no need to understand it in detail for now.

We are interested in elapsed time as a measurement for the performance of the application i.e. how well the application was optimized. The assumption is that the better the optimizations applied to the application the less time the application will take to complete a workload. We will run the same application in two different ways: GraalVM Native Image without PGO and GraalVM Native Image with PGO.

Build instructions

Assuming that an environment variable JAVA_HOME points to an installation of GraalVM we can run the command

$ $JAVA_HOME/bin/java -version
    java version "21.0.1" 2023-10-17
    Java(TM) SE Runtime Environment Oracle GraalVM 21.0.1+12.1 (build 21.0.1+12-jvmci-23.1-b19)
    Java HotSpot(TM) 64-Bit Server VM Oracle GraalVM 21.0.1+12.1 (build 21.0.1+12-jvmci-23.1-b19, mixed mode, sharing)

To confirm the environment variable is set up correctly to a GraalVM version we expect.

Our first step is to compile our .java file to a class file.

$ $JAVA_HOME/bin/javac GameOfLife.java

We also need to build a native executable of the application, as follows.

$ $JAVA_HOME/bin/native-image -cp . GameOfLife -o gameoflife-default
    ========================================================================================================================
    GraalVM Native Image: Generating 'gameoflife-default' (executable)...
    ========================================================================================================================
    For detailed information and explanations on the build output, visit:
    https://github.com/oracle/graal/blob/master/docs/reference-manual/native-image/BuildOutput.md
    ------------------------------------------------------------------------------------------------------------------------
    [1/8] Initializing...                                                                                    (3.5s @ 0.14GB)
     Java version: 21.0.1+12, vendor version: Oracle GraalVM 21.0.1+12.1
     Graal compiler: optimization level: 2, target machine: x86-64-v3, PGO: ML-inferred
    ...

Now we can move on to building a PGO-enabled native executable. As outlined before, the first step is to build an instrumented executable that will produce a profile for the run-time behaviour of our application. We do this by adding the --pgo-instrumented to the native-image command as shown below.

$ $JAVA_HOME/bin/native-image -cp . GameOfLife -o gameoflife-instrumented --pgo-instrument
    ========================================================================================================================
    GraalVM Native Image: Generating 'gameoflife-instrumented' (executable)...
    ========================================================================================================================
    For detailed information and explanations on the build output, visit:
    https://github.com/oracle/graal/blob/master/docs/reference-manual/native-image/BuildOutput.md
    ------------------------------------------------------------------------------------------------------------------------
    [1/8] Initializing...                                                                                    (3.6s @ 0.14GB)
     Java version: 21.0.1+12, vendor version: Oracle GraalVM 21.0.1+12.1
     Graal compiler: optimization level: 2, target machine: x86-64-v3, PGO: instrument
    ...

This will result in the gameoflife-instrumented executable which is in fact the instrumented build of our application. It will do everything our application normally does, but just before exiting it will produce a .iprof file, which is the format native image uses to store run-time profiles. By default the instrumented build will store the profiles into default.iprof but we can specify the exact name/path of the iprof file where we want the profiles saved. We do this by specifying the -XX:ProfilesDumpFile argument when launching the instrumented build of the application. Below we see how we run the instrumented build of the application, specifying that we want the profile in the gameoflife.iprof file. We also provide the standard expected inputs to the application - the initial state of the world (input.txt), where we want the final state of the world (output.txt) and how many iterations of the simulation we want (in this case 10).

$ ./gameoflife-instrumented -XX:ProfilesDumpFile=gameoflife.iprof input.txt output.txt 10

Once this run finishes, we have a run-time profile of our application contained in the gameoflife.iprof file. This enables us to finally build the optimized build of the application, by providing the run-time profile of the application using the --pgo option as shown below.

$ $JAVA_HOME/bin/native-image -cp . GameOfLife -o gameoflife-pgo --pgo=gameoflife.iprof
    ========================================================================================================================
    GraalVM Native Image: Generating 'gameoflife-pgo' (executable)...
    ========================================================================================================================
    For detailed information and explanations on the build output, visit:
    https://github.com/oracle/graal/blob/master/docs/reference-manual/native-image/BuildOutput.md
    ------------------------------------------------------------------------------------------------------------------------
    [1/8] Initializing...                                                                                    (3.6s @ 0.14GB)
     Java version: 21.0.1+12, vendor version: Oracle GraalVM 21.0.1+12.1
     Graal compiler: optimization level: 3, target machine: x86-64-v3, PGO: user-provided
    ...

With all this in place we can finally move on the evaluating the run-time performance of our application running in the different modes.

Evaluation

We will run both of our application executables using the same inputs. We measure our elapsed time sing the Linux time command with a custom output format (--format=>> Elapsed: %es). Note: We fixed the CPU clock at 2.5GHz during all the measurements to minimize noise and improve reproducibility.

1 iteration

Let’s start of small and run our Game Of Life application for a single iteration. The commands and output of both our application builds is shown below.

$ time  ./gameoflife-default input.txt output.txt 1
    >> Elapsed: 1.62s

$ time  ./gameoflife-pgo input.txt output.txt 1
    >> Elapsed: 0.99s

Looking at the elapsed time, we see that running the PGO executable is substantially faster in terms of percentage. The half a second of difference does not have a huge impact for a single run of this application, but if this was a serverless application that executes frequently, then the cumulative performance gain would start to add up.

100 Iterations

We now move on to running our application for 100 iterations. Same as before, the executed commands and the time output is shown below.

$ time  ./gameoflife-default input.txt output.txt 100
    >> Elapsed: 24.40s

$ time  ./gameoflife-pgo input.txt output.txt 100
    >> Elapsed: 14.40s

Native Image performance comparison with and without PGO

In both of our example runs (1 and 100 iterations), the PGO build outperforms the default native-image build significantly. The amount of improvement that PGO provides in this case is of course not representative of the PGO gains for real world applications, since our Game Of Life application is small and does exactly one thing so the profiles provided are based on the exact same workload we are measuring. But it illustrates the general point — profile guided optimizations allow AOT compilers to perform similar tricks that JIT compilers can do in order to improve the performance of the code they generate.

Binary size

As a bonus perk of using PGO for our native-image build, let’s look at the size of the executable of the default executable as well as the PGO executable of our application. We will use the Linux du command as shown below.

$ du -hs gameoflife-default
    7.9M    gameoflife-default

$ du -hs gameoflife-pgo
    6.7M    gameoflife-pgo

As we can see, the PGO build produces a ~15% smaller binary than the default build. Recall that the PGO version outperformed the default version for both iteration counts we tested with. Recall also that certain optimizations, such as function inlining that was mentioned earlier, increase the binary size in order to improve performance. So how can PGO builds produce smaller but better-performing binaries?

This is because the profiles we provided for the optimizing build allow the compiler to differentiate between code important for performance (i.e. hot code, code where most of the time is spent during run time) and code that is not important for performance (i.e. cold code, code where we do not spend a lot of time during run time, such as error handling). With this differentiation available, the compiler can decide to focus more heavily on optimizing the hot code and less or not at all on the cold code. This is a very similar idea to what a JVM does — identifies the hot parts of the code at run time and compile those parts at run time. The main difference is that Native Image PGO does the profiling and the optimizing ahead-of-time.

Conclusion

In this text we presented an overview of the main ideas behind Profile-Guided Optimization (PGO) with a special focus on the implementation of PGO for Native Image. We discussed how recording the behaviour of an application at run time (i.e. Profiling) and storing this information for later use (i.e. the profile stored in an .iprof file for Native Image) can enable an ahead-of-time compiler to have access to information that it normally does not have. This information can be used to guide decision making in the compiler, which can result in better performance as well as smaller binaries. We illustrated the benefits of PGO on a toy Game-of-Life example.

It is also important to note that PGO is not a trivial technique to use. This is because PGO, in order to be beneficial, requires executing the instrumented build with realistic workloads. So bear in mind that PGO is only as good as the profiles provided to the optimizing build. This means that profile gathered on a counter-productive workload could be counter-productive, and workloads that cover only a part of the application’s functionality will likely yield a smaller performance gain compared to a realistic workload with a better coverage.

In cases where you aren’t using PGO, Oracle GraalVM will by default use ML-based profile inference, which provides around ~6% runtime speedup comparing to not having profiling information.