Better Java Streams performance with GraalVM

The functional Streams API introduced in Java 8 is a neat and efficient way to declaratively express programs that do data-processing. However, due to the high abstraction overhead involved in using lambdas, Stream-based programs are known to run slower than their low-level loop-based counterparts.

This article will show, using concrete examples, that GraalVM achieves a performance improvement of between 2x and 5x on Stream programs compared to Java HotSpot VM.

We will go over several Stream-based programs, and measure their performance using the GraalVM and the standard Java HotSpot VM, and conclude the article with an example of larger Stream-based program.

You can download GraalVM, set up a simple JMH project and try to conduct the measurements yourself, but it is not necessary for reading the article.

Setup

If you want to run the examples that we will show below, you will need to download GraalVM here. If you just want to read the article without running the examples, feel free to skip to the next section.

We will use the Enterprise edition of GraalVM to run the example programs. We will unpack the archive to the ~/graalvm directory, and check the Java version of our default java command and the one provided by GraalVM:

$ java -version
java version "1.8.0_111"
Java(TM) SE Runtime Environment (build 1.8.0_111-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.111-b14, mixed mode)
$ ~/graalvm/bin/java -version
java version "1.8.0_172"
Java(TM) SE Runtime Environment (build 1.8.0_172-b11)
GraalVM 1.0.0-rc5 (build 25.71-b01-internal-jvmci-0.46, mixed mode)

Next, we will create a simple JMH project using Maven, so please ensure that you have Maven installed on your machine. JMH has a simple command used to generate a benchmarking project, with a src/main/java directory for the source code.

$ mvn archetype:generate \
-DinteractiveMode=false \
-DarchetypeGroupId=org.openjdk.jmh \
-DarchetypeArtifactId=jmh-java-benchmark-archetype \
-DgroupId=org.sample \
-DartifactId=test \
-Dversion=1.0

Alternatively, you can download the complete source code for this article here.

Now, let’s create a Streams.java file in src/main/java with a simple JMH benchmark. This program creates a stream from a 2M element array, and maps each number using several mapping functions before calling reduce to sum the mapped values. The most important part of the program is in the mapReduce method, which invokes these Stream operations:

We package and run this example as follows:

$ mvn package

Measuring Performance

We can now run the mapReduce example with the standard HotSpot java command. The extra -f1, -wi 4, and -i4 arguments mean that we will use 1 fork of the JVM (note: JMH does the measurements in a separate JVM process to avoid profile pollution and noise), with 4 warmup runs, and 4 measurement runs.

$ java -jar target/benchmarks.jar mapReduce -f1 -wi 4 -i4

On our machine, with an i7–4900mq CPU with 32 GB RAM, we get the following output from JMH:

Benchmark           Mode  Cnt   Score    Error  Units
Streams.mapReduce thrpt 4 38.514 ± 2.172 ops/s

When running on HotSpot, JMH reports that the benchmarked program ran around 38 times per second. Similarly, we can run the example with GraalVM with the following command:

$ ~/graalvm/bin/java -jar target/benchmarks.jar mapReduce \
-f1 -wi 4 -i4

The output from by JMH shows a performance improvement of around 1.8x when using GraalVM:

Benchmark           Mode  Cnt   Score    Error  Units
Streams.mapReduce thrpt 4 69.237 ± 2.524 ops/s

Lets change the benchmark slightly, to include the call to parallelize the stream and measure the results then.

With standard HotSpot, we get around 112 operations per second:

Benchmark           Mode  Cnt    Score    Error  Units
Streams.mapReduce thrpt 4 112.904 ± 3.297 ops/s

With GraalVM, we get approximately 190 operations per second:

Benchmark           Mode  Cnt    Score    Error  Units
Streams.mapReduce thrpt 4 190.514 ± 4.361 ops/s

In this example, we see a slightly slower warmup with GraalVM, but 1.7x peak performance:

Let’s now take a look at a larger example. We will declare a class Person to hold data about different people. This class will track the age, height and the hairstyle of a specific person. The Hairstyle enumeration will track the hair length.

In the code above, we also declared several constants — LONG_RATIO to set the expected percentage of people with long hair, MAX_AGE to set the maximum age, and MAX_HEIGHT to set the maximum height. We use these constants to generate a random set of people, as shown in the following generatePeople method:

We can now do some data crunching on our dataset of people.

Let’s say that we are interested only in people with short hair, and, among them, we want to compute the average height of young people. We will define young people as all the people whose age is below the average age of short-haired people. We first need to find all the short-haired people, and compute their average age — the figure below shows the set of people we filter first.

We then filter only the people whose age is below the previously computed average age — for these people, we then compute their average height. We achieve this with two Stream pipelines as follows:

We can now run this benchmark and get some performance numbers. The following chart shows the warmup curves for Java HotSpot VM and for GraalVM.
During the initial iteration, GraalVM reaches the throughput of around 5000 repetitions per second, but then keeps improving until reaching a throughput of around 2.1x compared to HotSpot.

In specific cases, the performance difference can be even larger. Assume that we want to filter out people who, in a year from now, will have the potential to be future volleyball stars, and then compute their average age. We will define a potential volleyball star as a person whose age is between 18 and 21, and whose height is above 198 cm (we assume that the height of the people will not change in 1 year). In the preceding figure, we can find only 3 such people — not everyone can be a volleyball professional:

Using Streams, we can achieve this by first mapping each Person object into another Person object with the age field incremented by 1 (since we are interested in people that will have volleyball potential in a year from now).
Next, we filter people whose height is above 198 cm, and then filter people whose age is between 18 and 21. Finally, we map each Person object into its respective age, and we then compute the average:

On this benchmark, JMH reports a throughput difference in favour of GraalVM of more than 5 times!

Larger example

To conclude this article, we note that all the examples shown so far were very small programs. This is useful for demonstration purposes, because it is easier to understand what’s going on.

However, GraalVM optimizes Stream programs that are more than just simple microbenchmarks. To show an effect of these optimizations on larger Stream-based programs, we took the open-source Scrabble benchmark for Stream programs, made by Jose Paumard, and we ran it using C2 and Graal. This benchmark uses Streams to functionally encode a Scrabble solver that picks words from Shakespeare’s works.

@Benchmark
public int scrabble() {
return JavaScrabble.run();
}

You can find the complete source code in this repository.

When running this benchmark using GraalVM and Java HotSpot VM, we see a difference of around 1.9x in peak performance when using GraalVM.

Conclusions

GraalVM allows to run Java code which uses Streams API faster. The exact performance boost as always will depend on the code and workload, but in the sample measurements for this article we got performance improvements from 1.7x to 5x when running on GraalVM compared to Java HotSpot VM.

You can get the GraalVM binaries and try it on your code and performance tests.

If you have more Streams API examples that you would like to share, let us know! And if you have any feedback on GraalVM, leave us a comment, join the GraalVM community, or ping us on Twitter: @graalvm.