Graal vs C2 who runs Kotlin faster?

See how the new compiler can make Kotlin run faster

Uberto Barbini

Published in

Javarevisited

12 min readMar 13, 2020

More than a year ago I wrote a post about GraalVM performance in Kotlin for the Java Advent Calendar 2018.

It’s time to re-evaluate the results using the new GraalVM v.20.0 and with both the Community Edition and the Enterprise Edition of GraalVM.

You may have heard of Graal, the new JIT compiler for the JVM written in Java. It is available inside the JDK since Java10 and sometime in the future could become the standard compiler in the JDK.

If you are interested, you can find more information here: https://www.infoq.com/articles/Graal-Java-JIT-Compiler

Quick introduction

Since Java 1.3 HotSpot contains two Just In Time compilers: one very quick called C1 and another one slow but production more optimized code called C2. JVM uses C1 for a quick start-up whilst C2 is used later to improve performance.

Graal is the new compiler for JVM Bytecode written in Java. It uses the JVMCICompiler interface introduced with Java9. It can be run in JIT or in Ahead Of Time mode to produce native applications.

GraalVM is a complete replacement for the standard JDK which uses Graal compiler and other related technologies (i.e. Truffle, Substrate).

So how can we give Graal a run? Since Java 10 you can activate Graal compiler on any JDK simply using two VM Options -XX:+UnlockExperimentalVMOptions -XX:+UseJVMCICompiler. Unfortunately, the Graal compiler included inside the JDK is not always the most up-to-date version, so a better solution is to switch to GraalVM as JDK.

If you have SdkMan installed, you just need three commands:

sdk list java
//choose the right GraalVM versionsdk install java 20.0.0.r11-grl
//install itsdk use java 20.0.0.r11-grl
//select it as current JDK

If you don’t have SdkMan installed you can download Graal from the Oracle site, expand it somewhere and set up JAVA_HOME… but really, why don’t just install SdkMan instead?

GraalVM comes in two flavors: CE and EE. You can read about it here, but the gist is that CE is fully open source, while EE is only free for evaluation and not-production use. EE is based on CE with some additional optimizations, that may be or may not be important for you, as we will see.

To download the Enterprise Edition you need to register on Oracle and sign the license agreement.

I have prepared three algorithms to test the compilers: a Mandelbrot Set generator, a Knapsack solver and a Modular Algebra calculator. The full code is available on Github.

I’ve tried to write the code in the most idiomatic Kotlin way, taking advantage of extension functions, data classes, context programming, for each loop. Not only the resulting code is very easy to read, but also the performances are not significantly slower than using a “C” like syntax.

I have chosen these three algorithms to have some realistic example of very different CPU-intensive tasks, and the code is completely in Kotlin, without using any Java library.

Mandelbrot Set

The Mandelbrot Set is probably the most famous fractal shape and you may have seen it even if you don’t know the name.

Mathematically is defined as the Set of all points in the Complex plane where the function z <- z²+c will not diverge when iterated. It is quite easy to generate the set iterating the functions for some points on the complex plane and creating an image out of them.

Since the goal here is the performance and not the graphics I kept things simple using text graphic.

Let’s start looking at the code for the Mandelbrot Set. Here is the definition of a type to represent the Complex numbers:

data class Complex(val r: Double, val i: Double){
    operator fun times(other: Complex) =
        Complex(
            r = this.r * other.r - this.i * other.i,
            i = this.i * other.r + this.r * other.i
        )
    operator fun plus(other: Complex) =
        Complex(r = this.r + other.r, i = this.i + other.i)
    operator fun minus(other: Complex) =
        Complex(r = this.r - other.r, i = this.i - other.i)
    fun squared() = this * this
    fun squaredModule() = r * r + i * i
    fun Double.toComplex() = Complex(r=this, i=0.0)
}

Then the actual calculation for each point of the Set:

fun mandelSet(initZ: Complex, c:Complex, maxIter: Int): Int {
    var z = initZ
    (1..maxIter).forEach{
        z = z.squared() + c
        if (z.squaredModule() >= 4)
            return it
    }
    return maxIter
}

You can see here how using operator overloading and a data class to represent complex numbers can really simplify the code and making easier to understand.

Once we defined in the Complex class the rules to do operations on complex numbers, the mandelSet function has only to check if the operation z <- z²+c is “escaping” or not and in case after how many iteration it will exceed the threshold of 4.

Here you can see the characteristic cardioid shape of Mandelbrot Set in the output rendered in AsciiArt:

Knapsack problem

The Knapsack problem can be defined in several ways. Imagine a thief that just broke into a watch shop. He can steal as many watches as he wants, provided that he doesn’t exceed the maximum weight that he can carry on his knapsack.

Being a rational thief, he definitely wants to optimize the value of the watches in his knapsack. Every watch has a price and a weight attached. So we need an algorithm to find the set of Watches with the maximum total price for a given weight.

Real-world applications of this algorithm include optimizing cuts and materials for CNC machines and marketing strategies for allocating advertising budget.

For example, let’s start with a shop with only 3 watches defined like this:

val shop = Knapsack.shop(
    Watch(weight = 1, price = 1),
    Watch(weight = 3, price = 2),
    Watch(weight = 1, price = 3)
)

If we have a maximum weight of 1, it’s better for us to pick up the third watch, rather than the first, because the value is higher.

If we have a maximum weight of 3, we can choose the number 2 (price 2) or number 1 and 3 (price 1 + 3). In this case, it’s better to pick up 1 and 3 even if their combined weight is less than the maximum.

These are the complete solutions for this shop:

assertEquals(3, selectWatches(shop, maxWeight = 1))
assertEquals(4, selectWatches(shop, maxWeight = 2))
assertEquals(4, selectWatches(shop, maxWeight = 3))
assertEquals(5, selectWatches(shop, maxWeight = 4))
assertEquals(6, selectWatches(shop, maxWeight = 5))

As you can see, as the number of watches available increase, the number of possible choices become high very, very quickly. It is a classical NP-Hard problem.

To solve it in reasonable times we need to cheat a bit and use Dynamic Programming. We can build a map with the already optimized solution for each set of watches and so we can avoid recalculating them each time.

The general algorithm is based on an exhaustive search based on recursion. This is the Kotlin code to solve it, separated in a memoization function and the recursive search of maximum value.

typealias Memoizer = MutableMap<String, Int>fun priceAddingElement(memo: Memoizer, shop: Set<Watch>, choice: Set<Watch>, maxWeight: Int, priceSum: Int): Int =
    shop.filter { !(it in choice) && it.weight <= maxWeight }
        .map {
            selectWatches(
                memo,
                shop,
                maxWeight - it.weight,
                choice + it,
                priceSum + it.price) }
        .filter { it > priceSum }
        .max() ?: priceSum
fun selectWatches(memo: Memoizer, shop: Set<Watch>, maxWeight: Int, choice: Set<Watch>, priceSum: Int): Int =
    memoization(memo, generateKey(choice)) {
        priceAddingElement(memo, shop, choice, maxWeight, priceSum)}
private fun memoization(memo: Memoizer, key: String, f: () -> Int): Int = when (val w = memo[key]) {
        null -> f().also { memo[key] = it }
        else -> w
    }

I really like how Kotlin allows expressing the intention clearly without having to repeat yourself.

Modular Algebra

Modular algebra is a very useful and interesting branch of mathematics. It’s also called “clock algebra” because it resembles the operation with hours in a clock, for example, 7 + 6 usually is equal to 13, but if we are talking about the time it will be equal to 1. Here the clock is a modulo 12 algebra, but we can generalize it to any number.

It has a lot of interesting properties, for example keeping adding a number which is prime with the modulo will generate a sequence with all the numbers up to the modulo. This means that if we add a number prime to 12 in our clock algebra we will generate a sequence with all the hours. For example, 5 is the smallest number prime to 12, the sequence of 5 in modulo 12 is: 5, 10, 3, 8, 1, 6, 11, 4, 9, 2, 7, 0 and then it repeats itself.

This is the class to define a modular number and its operations. Note that division is quite problematic in modular algebras and I used a brute force approach as a shortcut:

data class ModularNumber(val num: Int, val modulo: Int): Comparable<ModularNumber>  {

    override fun compareTo(other: ModularNumber): Int = num.compareTo(other.num)

    operator fun plus(other: ModularNumber) = (num + other.num).toModularNumber()
    operator fun minus(other: ModularNumber) = (num - other.num).toModularNumber()
    operator fun times(other: ModularNumber) = (num * other.num).toModularNumber()
    operator fun div(other: ModularNumber) = ((1..modulo).first { (it * other.num) % modulo == (this.num % modulo) }).toModularNumber()
    operator fun inc() = (num + 1).toModularNumber()

    fun squared() = (num*num).toModularNumber()

    fun Int.toModularNumber(): ModularNumber =
        ModularNumber(this % modulo, modulo)

}

What can we do with such a charming algebra? We can draw nice patterns! To produce they we only need to compare the square of two numbers, inside modular algebra.

data class PlaneGrid(val size: Int, val initPredicate: (Int) -> Boolean) {
    private val grid = BooleanArray(size*size, initPredicate)

    operator fun get(x: Int, y: Int): Boolean = grid[coord(x,y)]

    private fun coord(x: Int, y: Int) = (y-1) * size + x -1

    fun count(): Int = grid.count {it}
}

val compareSquares: (ModularNumber, ModularNumber) -> Boolean =
    { x, y -> x.squared() >= y.squared() }

fun compareSquares(size: Int, modulo: Int): Int =
    ModularField(modulo).applyFunction(compareSquares)(size).count()

fun sumOfFunction(from: Int, to: Int, f: (Int) -> Int) =
    (from..to)
        .map { f(it) }
        .sum()data class ModularField(val modulo: Int) {  fun applyFunction(f: (ModularNumber, ModularNumber) -> Boolean):    (Int) -> PlaneGrid =
        { size ->
            PlaneGrid(size) { index ->
                val xMod = (index % size).toModularNumber()
                val yMod = (index / size).toModularNumber()
                f(xMod, yMod)
            }
        }  fun Int.toModularNumber(): ModularNumber =
        ModularNumber(this % modulo, modulo)
}

These are some examples of modular patterns generated by the above functions:

Modulo 5              Modulo 15            Modulo 12
@@@@@@@@@@@@@@@@@@@@  @@@@@@@@@@@@@@@@@@@@ @@@@@@@@@@@@@@@@@@@@
@@@@ @@@@ @@@@ @@@@   @@@@@@@@@@@@@@ @@@@  @@@@@ @@@@@ @@@@@ @
@@   @@   @@   @@     @@ @@@@@@ @@   @@    @@@   @@@   @@@
@@   @@   @@   @@     @ @    @ @     @     @     @     @
@@@@ @@@@ @@@@ @@@@   @@@@@@@@@@@@@@ @@@@  @@@   @@@   @@@
@@@@@@@@@@@@@@@@@@@@  @    @               @@@@@ @@@@@ @@@@@ @
@@@@ @@@@ @@@@ @@@@   @ @@  @@ @     @     @@@@@@@@@@@@@@@@@@@@
@@   @@   @@   @@     @@ @@@@@@ @@   @@    @@@@@ @@@@@ @@@@@ @
@@   @@   @@   @@     @@ @@@@@@ @@   @@    @@@   @@@   @@@
@@@@ @@@@ @@@@ @@@@   @ @@  @@ @     @     @     @     @
@@@@@@@@@@@@@@@@@@@@  @    @               @@@   @@@   @@@
@@@@ @@@@ @@@@ @@@@   @@@@@@@@@@@@@@ @@@@  @@@@@ @@@@@ @@@@@ @
@@   @@   @@   @@     @ @    @ @     @     @@@@@@@@@@@@@@@@@@@@
@@   @@   @@   @@     @@ @@@@@@ @@   @@    @@@@@ @@@@@ @@@@@ @
@@@@ @@@@ @@@@ @@@@   @@@@@@@@@@@@@@ @@@@  @@@   @@@   @@@
@@@@@@@@@@@@@@@@@@@@  @@@@@@@@@@@@@@@@@@@@ @     @     @
@@@@ @@@@ @@@@ @@@@   @@@@@@@@@@@@@@ @@@@  @@@   @@@   @@@
@@   @@   @@   @@     @@ @@@@@@ @@   @@    @@@@@ @@@@@ @@@@@ @
@@   @@   @@   @@     @ @    @ @     @     @@@@@@@@@@@@@@@@@@@@
@@@@ @@@@ @@@@ @@@@   @@@@@@@@@@@@@@ @@@@  @@@@@ @@@@@ @@@@@ @

Benchmarks

And now the part you were waiting for all along. Let’s compare the performance of the two compilers!

Let’s remember that Graal is written in Java and it is taking advantage of new researches in the compiler space, but it is still relatively young. C2 on the other side is extremely well-tuned and mature.

Now a word about testing methodology: All tests run on my Notebook with i7 4.4Ghz 32Gb. I ran the tests several times with different parameters and the results have been consistent. The actual values in the graphs have been taken in the same session without other applications running.

I run a test called PerformanceTest in each project which executes an end to end benchmark continuously. In this way, the compiler cannot optimize the code for a specific case. Then I choose the faster result, assuming it is the one without GC and OS pauses. All tests are running mono-thread.

Here are the results in microseconds as an average of many runs:

These results are consistent in showing a definite performance gain using Graal over Hotspot. The main reason is the more sophisticated inlining and escape analysis optimizations.

In some cases, Graal EE special optimizations (like loop unrolling) give it a further advantage.

As a methodology, I always suggest starting with an end-to-end continuous running benchmark, as it is the safest way to collect performance metrics. It will give us the best approximation of production performance.

If we want to understand better where the bottlenecks are, we need to create more specific micro-benchmarks. Fortunately for us, there is an incredibly good library to do exactly this.

JMH

Enter Java Micro-benchmark Harness. JMH is a library to simplify the task to implement Java microbenchmarks correctly. JMH has been created by some JVM developers and it is used for JVM internal performance test. We can take this as a good indication of its quality.

How to set up JMH using Gradle (and Kotlin)?
You just need to edit your build.gradle file. First, add the dependencies:

testImplementation("org.openjdk.jmh:jmh-core:1.23")
testImplementation("org.openjdk.jmh:jmh-generator-annprocess:1.23")

Then, add the Gradle JMH plugin:

plugins {
    kotlin("jvm") version "1.3.70"
    id("me.champeau.gradle.jmh") version "0.5.0"
}

Finally, add a section with the properties you want to set. You can find the full list here.

jmh {
    jmhVersion = "1.23"
    jvmArgs = listOf(
        "-Xms6g",
        "-Xmx6g",
        "-Dgraal.ShowConfiguration=info",
        "-XX:+AlwaysPreTouch",
        "-XX:+UnlockExperimentalVMOptions",
        "-XX:+UseJVMCICompiler"
    )
    warmupForks = 0
    fork = 2
    warmupIterations = 5
    iterations = 5
}

How to create a test for JMH?

To create a new benchmark we only need to put an annotation around a function.

@Benchmark
fun knapsack(blackhole: Blackhole) {
    blackhole.consume(Knapsack.selectWatches(shop, 199))
}

There are many possible optional parameters for the benchmark annotation, the default mode is to count the operations for a second, but there are many other possible ways to run the test.

In this test, you can see we are usingblackhole class to prevent the compiler to consider the benchmark as “dead code” since we are not doing anything with the result of calculations.

Let’s see the results:

From this graph, we can see a bigger difference in raw CPU performance than in the continuous run test.

One difference is that JMH uses some techniques to avoid garbage collection cycles, why the continuous run after a while (usually from a few seconds to some minutes) would fill all the memory and require a full GC cycle.

Graal garbage collection is currently not as good as the Hotspot one, so this can partly explain the results. On the positive side, RedHat is currently investing to bring their great Shenandoah garbage collector to GraalVM. Expect new benchmarks when this will happen!

As a reference here is the exact setup for each test:

Hotspot /C2

# JMH version: 1.23
# VM version: JDK 11.0.6, OpenJDK 64-Bit Server VM, 11.0.6+9-jvmci-20.0-b02
# VM invoker: /home/ubertobarbini/.sdkman/candidates/java/20.0.0.r11-grl/bin/java
# VM options: -Xms6g -Xmx6g -Dgraal.ShowConfiguration=info -XX:+AlwaysPreTouch -XX:+UnlockExperimentalVMOptions -XX:-UseJVMCICompiler
# Warmup: 1 iterations, 10 s each
# Measurement: 1 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/timeBenchmark                                  Mode  Cnt   Score   Error  Units
PerformanceMicroBenchmark.knapsack        thrpt   10  12.686 ± 0.719  ops/s
PerformanceMicroBenchmark.mandelbrot      thrpt   10  49.591 ± 1.796  ops/s
PerformanceMicroBenchmark.modularAlgebra  thrpt   10  37.240 ± 3.422  ops/s

Graal 20.0 CE

# JMH version: 1.23
# VM version: JDK 11.0.6, OpenJDK 64-Bit Server VM, 11.0.6+9-jvmci-20.0-b02
# VM invoker: /home/ubertobarbini/.sdkman/candidates/java/20.0.0.r11-grl/bin/java
# VM options: -Xms6g -Xmx6g -Dgraal.ShowConfiguration=info -XX:+AlwaysPreTouch -XX:+UnlockExperimentalVMOptions -XX:+UseJVMCICompiler
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/timeBenchmark                                  Mode  Cnt   Score    Error  Units
PerformanceMicroBenchmark.knapsack        thrpt   10  19.060 ±  1.854  ops/s
PerformanceMicroBenchmark.mandelbrot      thrpt   10  85.828 ±  4.836  ops/s
PerformanceMicroBenchmark.modularAlgebra  thrpt   10  71.653 ± 10.713  ops/s

Graal 20.0 EE

# JMH version: 1.23
# VM version: JDK 11.0.6, Java HotSpot(TM) 64-Bit Server VM, 11.0.6+8-LTS-jvmci-20.0-b02
# VM invoker: /home/ubertobarbini/jvm/graalvm-ee-java11-20.0.0/bin/java
# VM options: -Xms6g -Xmx6g -Dgraal.ShowConfiguration=info -XX:+AlwaysPreTouch -XX:+UnlockExperimentalVMOptions -XX:+UseJVMCICompiler
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/timeBenchmark                                  Mode  Cnt    Score     Error  Units
PerformanceMicroBenchmark.knapsack        thrpt   10   24.595 ±   2.797  ops/s
PerformanceMicroBenchmark.mandelbrot      thrpt   10   92.343 ±   8.087  ops/s
PerformanceMicroBenchmark.modularAlgebra  thrpt   10  170.381 ± 106.960  ops/s

Conclusions

Last year Graal was an interesting new technology to look at, now I think it’s ready for the production time or at least serious consideration. Interestingly enough, using Kotlin removes one of the major obstacles to Graal adoption, namely the supported java version. Currently Graal comes in two flavours: Java8 or Java11. It will be a while before Graal will be on a par with the current JDK version but for the Kotlin developers this is not a big issue.

As usual, you can find all the code of this post on my GitHub repository.

If you are interested in more similar posts please follow me here or on my twitter account @ramtop