Improving performance of GraalVM native images with profile-guided optimizations

Published in

graalvm

8 min readAug 29, 2019

GraalVM Native Image tool rightfully attracts a lot of attention as it offers significant improvements in terms of startup speed and overall memory usage. However, if you create some benchmarks to evaluate peak performance you may observe that the native image sometimes doesn’t offer better throughput too.

In this article we look at pros and cons of GraalVM Native Image. We show an easy way of generating PGO profiles for native images introduced in GraalVM 19.2 that significantly improves throughput of generated native images for known workloads.

A guide on how to use profile-guided optimizations for GraalVM native images.

Benefits of Native Image

Are you seeking a system with fast startup, low system requirements and multi-threaded communication? Do you want to code in a higher level language than C? Do you want to enjoy the benefits of memory safety and automatic garbage collection? There are few languages especially designed to fit such a design landscape and you may be considering to rewrite your application in one of those. However, if you are living in the JVM world then there is an easier option. Use GraalVM’s native-imagetool!

The Native Image takes the bytecode of your application and compiles it to a native executable ahead of time. As a result one can program in any JVM language: Java, Scala, Kotlin — and get a single, self-contained executable file as output. Single file has many benefits: It can be easily copied from one system to another just by itself — it contains all the application code as well as necessary runtime support, like the garbage collector for example. Single file gets loaded and is ready to run — no need to seek for various JAR, properties & other miscellaneous files and wait for them to open, load and initialize. The file generated by Native Image gives us instant startup. In addition to that the Native Image tool is able to capture a snapshot of an application memory — e.g. you can bring your system into ready to run state and when the generated native executable is started it continues exactly from where it was. This eliminates repetitive initialization and makes the startup time even more instant.

Another benefit of ahead of time compilation is lower memory consumption. A typical JVM keeps enormous amount of metadata in memory in addition to the JIT generated native code. These metadata are needed to be able to de-optimize at almost any moment. Nothing like that is needed in case of Native Image — the generated code covers all the possible code paths and never de-optimizes. The native code is known to be enough and all the metadata can be dropped when the native executable is being generated.

In spite of all the above goodies, the Native Image fulfils the most important aspects of a JVM — one can use a language of own choice — be it Java, Scala, Kotlin, etc. One can benefit from all the development tools available for the JVM. One can use the strong concurrency guarantees of a JVM and one doesn’t need to care about garbage collection. The rich ecosystem of JVM full of useful libraries, tools and frameworks awaits to be compiled ahead of time.

Trade-offs of Native Image

The previous text might make you believe that Native Image is great and it should replace the Java HotSpot VM immediately. That would not be accurate. The benefits brought by Native Image aren’t for free — they come with a cost. As such there are some aspects where Native Image limits its users more than classical Java HotSpot VM would.

Obviously the native executable can only run on a single platform. If you generate the image for 64-bit Linux, it only runs on Linux. If for Mac, it runs on Mac. If the executable is generated for Windows, it is going to run only on Windows. The portability is restricted compared to classical JAR file. Another limitation is caused by missing metadata during runtime. The previous section mentioned missing metadata as a benefit, but it also has its cost. Since by default native image doesn’t retain information about classes and methods, one’s ability to perform reflection is limited. The reflection is still possible, but it has to be configured and compiled into the native executable. As there are many Java frameworks that rely on reflective access, getting them run on Native Image may require additional configuration. Yet another restriction comes from the fact that the Native Image runtime may not support all features of Java. Running Swing UI toolkit may not be possible as it is too dynamic. On the other hand, Native Image successfully managed to execute Javac, Netty, Micronaut, Helidon and Fn Project — all large and nontrivial applications running on top of JDK.

The last drawback associated with the ahead-of-time compilation is speed. What? I thought Native Image starts faster! Well, it does start significantly faster than similar JVM application, but at the end, when the application runs for a long time, the just-in-time compiler can actually outperform the AOT one. As the helidon.io team puts it:

“On the other hand, everything is always a tradeoff. Long running applications on traditional JVMs are still demonstrating better performance than GraalVM native executables due to runtime optimization. The key word here is long-running; for short-running applications like serverless functions, native executables have a performance advantage. So, you need to decide yourself between fast startup time and small size (and the additional step of building the native executable) versus better performance for long-running applications.”

Now we are getting to the main topic of this post. Let’s take a look why the peak performance of AOT compilation is slower and then let’s speed it up!

There is no Free Lunch!

By removing most of the typical metadata associated with JVM execution, native image gives up on further optimizations based on execution profiles. The ahead of time generated code is what one gets. There is no chance to do more inlining, co-locate code on hot paths or aggressively over optimize and rely on a trap to signal the need for de-optimization and less optimal compilation. These are exactly the optimizations that make JVM so great for reaching excellent peak performance. During ahead of time compilation Native Image doesn’t have enough information to generate such optimal code.

On the other hand, there is no need for initial interpretation of the bytecode. There is no need for deoptimizations and there is no support for random reflection poking around your classes. As a result for short-lived application native image starts faster, overall uses less memory. The benefits are huge, however everything comes at some cost. There is no free lunch. Or is it?

Improving Peak Performance of Native Image

Commonly used technique to mitigate the missing just in time optimization is to gather the execution profiles at one run and then use them to optimize subsequent compilation(s). GraalVM 19.2.0 Enterprise comes with an improved Profile Guide Optimizations system. Let’s demonstrate its functionality on a classical object oriented demo application — let’s work with shapes of geometric objects:

Shape.java

The above program introduces the Shape interface and its four implementations: Circle, Square, Rectangle and Triangle. The base interface defines area() method and each of the geometric classes overrides it and provides different implementation, suitable for its shape. Those who know how object oriented languages are implemented can already smell the problem. Right, if we create an array of shapes and go through it, the code will have to be ready for virtual method dispatch. Let's do it:

computeArea method

The array of all shapes can contain any instances and as such the call shape.area() has to be able to call any of the actual methods. That's usually done with a virtual method table associated with each geometric class. Find out the current shape is Circle, then lookup the actual implementation of Circle.area() method and call it. Doing this requires a bit of calculation. To demonstrate that let's generate a huge array of random objects and measure how much time invoking the computeArea method takes:

the main method which generates shapes and measures time

If you put all the above code into file Shape.java (do it in an empty directory), you can compile it with GraalVM's Native Image tool. To get started download GraalVM enterprise edition as well as GraalVM Enterprise Edition Native Image tool. Unpack GraalVM and use its gu tool to install (gu install --file native-image-installable-svm-svmee-*-19.2.0.jar) the bin/native-image utility. Then you can:

$ /graalvm-ee-19.2.0/bin/javac Shape.java$ /graalvm-ee-19.2.0/bin/native-image Shape$ ls -1
graalvm-ee-19.2.0
shape
'Shape$Circle.class'
'Shape$Rectangle.class'
'Shape$Square.class'
'Shape$Triagle.class'
Shape.class
Shape.java

A shape executable has been generated. When you run it, it is going to be completely standalone, start fast, require little memory, but it won't be optimized. Try it:

$ ./shape 15000 43243223423 30 square rectangle
last round 35 ms.$ ./shape 15000 43243223423 30 triangle circle
last round 34 ms

The actual execution time may vary depending on the speed of your computer. The absolute values do not matter much, we just want to make the execution faster. Let’s train our program to be ready for square and rectangle. To do so we need to capture the data about the actual program execution. Let’s thus generate the PGO data.

$ /graalvm-ee-19.2.0/bin/java -Dgraal.PGOInstrument=shape.iprof Shape 15000 43243223423 130 square rectangle

The shape.iprof file is generated once the execution is over. If you inspect its content, you may find out there is a reference to Shape$Square, but there is no reference to Shape$Circle. Of course - we've been training the program for square and rectangles, not circles! The fact that Shape$Circle is missing in the shape.iprof file signals that the training was successful. Let's now use the data and regenerate our native image:

$ /graalvm-ee-19.2.0/bin/native-image --pgo=shape.iprof Shape$ ./shape 15000 43243223423 30 square rectangle
last round 25 ms.

Speedup! Instead of 35ms we can now execute the trained program in 25ms. Just by training it, recording the compiler decisions and using them to guide the compilation, we have sped up our program by almost 30%.

Note that this result is still not exactly on par with running with a warmed up JIT compiler. If we run the same code on with a JIT compiler we still see better results.

 $ java Shape 15000 43243223423 130 square rectangle
last round 17 ms.

We’re working on enabling better optimizations in the GraalVM compiler used ahead-of-time, so the performance of native images should improve further in the future.

If you’re wondering whether the PGO optimization numbers translate well to the real world applications, you can try profile-guided optimizations on some larger project, for example on the Micronaut demo application for GraalVM. From our initial tests PGO shows good results there. We plan to expand on this topic in further articles.

Of course, the speed up from PGO is only visible when the real workload mimics the one that we’ve been training for. If the program input diverges and the execution gets into the non-optimized paths, it can actually be even slower than without any profiles:

$ ./shape 15000 43243223423 30 triangle circle
last round 49 ms.

Should something like that happen, it is time to re-profile your application, gather new PGO data and recompile. Note that prior to 19.2.0 one needed to create a special instrumented native image of the program to collect the profile information, but doing it by running application without preparing an instrumented native image is much simpler.

Conclusions

It is well known that GraalVM Native image gives you quick startup and consumes less memory. GraalVM 19.2.0 Enterprise brings you simplified way to use profile-guided optimizations (PGO) — with its help it is possible to train your application for specific workloads and significantly improve the peak performance. Download GraalVM 19.2.0 Enterprise and try it yourself.