Enhanced Automated Vectorization

Andrew Craik
graalvm
Published in
9 min readJul 29, 2021

In this article we look at how GraalVM Enterprise approached loop and linear vectorization optimizations introduced in 21.2 for new hardware architectures. We discuss considerations related to the GraalVM Native Image feature and a practical demo that you can try yourself. The particular optimizations discussed here are a part of GraalVM Enterprise, which is available with the Oracle Java SE Subscription.

Every new generation of chip manufacturing technology brings an increase in the number of transistors on a chip. These transistors are used to add features to the processor to increase performance. One recent trend across manufacturers is the use of these transistors to add, enhance, and expand the SIMD (Single Instruction Multiple Data) capabilities of their architectures: examples include Intel’s AVX512 extensions and ARM’s NEON extensions. These instructions allow one instruction to perform the same operation on multiple packed data elements, a so-called vector, at the same time. To provide an intuition for how these instructions operate consider the following loop:

for (int i = 0; i < 4; i++) {
array[i] = array[i] + 1;
}

The scalar, or non-SIMD, assembly implementation of this loop would require four loads, four adds, and four stores. Using SIMD instructions the implementation of the loop would require only one load, one add, and one store — each processing four data elements at the same time. Exploiting these instructions in a language runtime can help deliver the great performance users expect, but requires special support from the compiler to create and exploit the opportunity to use these instructions.

One of the great things about running your application on GraalVM is that the GraalVM Enterprise Edition compiler offers a rich set of automatic vectorization optimizations designed to allow your application to benefit from the latest SIMD instruction capabilities available on your target hardware platform. In this article we will give you a brief overview of the capabilities offered by GraalVM and show you a sample application you can try that demonstrates the benefits of this technology in action.

Vectorization Optimizations

How does a compiler find the opportunity to utilize SIMD instructions? In GraalVM Enterprise Edition we have two optimizations which target two different potential sources of SIMD opportunities:

  • Loop vectorizer — analyzes loops to try and reduce the entire loop into a SIMD form
  • Linear vectorizer — analyzes sequential code for SIMD opportunities

The GraalVM LoopVectorization phase is enabled by default in all recent releases of GraalVM Enterprise Edition while the SIMDVectorization phase is new in GraalVM Enterprise Edition 21.2 and can be enabled using -Dgraal.VectorizeSIMD=true.

Loop Vectorization

Loops are one of the most significant sources of computational complexity in applications. Loop vectorization works best on certain kinds of loops that are common in application domains like machine learning, computer graphics, and scientific computing. These are typically pure arithmetic computations on large arrays of numbers. Being able to transform these computationally complex loops to use SIMD instructions can have a significant, positive impact on application performance.

The GraalVM loop vectorizer looks at entire loops and identifies computation patterns that have SIMD parallelism across adjacent loop iterations. In effect, a vectorized loop will execute several iterations of the original loop in parallel. The loop vectorizer recognizes two primary loop patterns: map loops and fold loops.

“Map” loops sequentially read elements from one or more arrays, perform arithmetic operations on these elements, and write the resulting values sequentially to a target array. The source and target arrays may be the same:

for (int i = 0; i < z.length; i++) {
z[i] = 3 * x[i] + y[i] + 7;
}

“Fold” or “reduce” loops sequentially read elements from one or more arrays and apply arithmetic operations to combine them with an accumulator. At the end of the computation the result is a single value. However, unlike with “map” loops, the arithmetic involved must be associative and commutative. Because floating-point arithmetic is not associative (due to rounding), we can only generate SIMD code for “fold” loops on integers:

int sum = 0;
for (int i = 0; i < array.length; i++) {
sum += array[i];
}

Branching computations in loops are difficult to vectorize in general and this is an area of ongoing development in the GraalVM loop vectorizer. An important case that already works well on the JVM involves guards. Guards are special constructs inside the compiler marking code that is expected to execute very rarely, or never. For example, the JVM’s profiling information allows the compiler to identify pieces of code that have never been executed so far, and which aren’t expected to be executed in the future either.

Consider the following snippet of code adapted from a real-world workload, performing computations on an array of `double`s, checking each for erroneous NaN (“not a number”) values:

boolean seenError = false;
for (int i = 0; i < array.length; i++) {
if (Double.isNaN(array[i])) {
seenError = true;
}
array[i] = array[i] + increment;
}

If the isNaN check has never failed before this code is compiled, the entire “error” branch is marked with a guard as unreached code, and the actual contents of the branch are irrelevant and even invisible to the compiler. As long as the guard’s condition can be vectorized (in this case, it is a simple SIMD comparison), this loop shape can be vectorized by GraalVM. Only if the guard ever fails do we need to recompile the full code and revert to a non-SIMD loop implementation.

The loop vectorizer can also vectorize “map” and “fold” operations using conditional expressions where the condition selecting the value to produce or accumulate is based on the value being read. In general, the GraalVM loop vectorizer can transform loops containing any number of “fold” or “map” operations. However, loops containing guards can only be vectorized if their constituent operations are purely “fold” operations or “map” operations — mixed “fold” and “map” operations are not supported in loops with guards.

Finally, some SIMD instruction sets include a “gather” instruction, which reads array elements from non-consecutive indices. When supported by the CPU, the GraalVM loop vectorizer supports generation of SIMD code for such gather operations, including the implicit guards for checking whether each index accessed is in bounds for the underlying array. The loop vectorizer does not currently support the dual “scatter” operation for writes to non-consecutive array elements.

Linear Vectorizer

The linear vectorizer is a new optimization in GraalVM Enterprise Edition version 21.2 which is designed to complement the loop vectorizer. The linear vectorizer focuses on linear code segments to identify opportunities for the use of SIMD instructions. It ignores loop bodies that are of forms handled by the loop vectorizer, but will process linear code segments in loop bodies not handled by the loop vectorizer in an attempt to accelerate individual loop iterations.

The linear vectorizer operates on a pattern matching mechanism. The matching operates in three phases:

  1. Match load operations and group them based on the object and field / element being read
  2. Match architecture-specific operations
  3. Match store operations and group them based on the object and field / element being written

Once matching is completed the algorithm groups architecture-specific operations trying to connect stores to loads. If connections are successfully made, the program is transformed to use SIMD instructions for the operations that have been grouped. The optimization also has special support for processing loop bodies and for pattern-matching operations that decompose a value into constituent parts, or which compose parts to create a new value.

As a more concrete example consider the following snippet incrementing four adjacent array elements:

int[] data = …;data[0] = data[0] + 1;
data[1] = data[1] + 1;
data[2] = data[2] + 1;
data[3] = data[3] + 1;

The linear vectorizer can pattern-match the load, add and store operations, and transform the program to use SIMD instructions to load, add, and store in three instructions as shown in the following pseudo-code:

int[] data = …;data[<0,1,2,3>] = data[<0,1,2,3>] + <1,1,1,1>

The linear vectorizer is not enabled by default in GraalVM Enterprise Edition version 21.2 — it can be enabled using -Dgraal.VectorizeSIMD=true. The optimization can, in some situations, increase the amount of compile time for some methods measurably, so enabling this optimization in situations where VM start-up time, or the time to ramp-up to peak throughput is critical, is not recommended at this time. We will address this limitation before we look to enable the linear vectorizer by default.

Native Image Considerations

Both the loop and linear vectorization optimizations are supported when building GraalVM native images, but there are some limitations to be aware of:

  • Some optimizations which transform the program in ways that enable the vectorization optimizations require information about loop frequencies. As a result, only native images built with profile-guided optimization (PGO) will see the full benefit of GraalVM’s vectorization capabilities.
  • Due to differences in how guards are implemented in Native Image, the vectorization of loops with guards is not always possible when building a native image with GraalVM.
  • The vectorization optimizations know how to target the AVX, AVX2, and AVX-512 SIMD instruction sets on x86 processor. However, each of these instruction sets is optional and not necessarily supported by all processors. As a result, you need to use the NativeArchitecture or CPUFeatures flags to tell the compiler which instruction subsets it is allowed to use based on which processor you want the resulting image to work on.

We are working to mitigate the impact of instruction set variation on Native Image performance. Currently, there are no instruction subset limitations on the AArch64 platform.

A Practical Demonstration

Now that we have described the capabilities of the vectorization optimizations in GraalVM Enterprise Edition 21.2 we want to leave you with a sample program that you can run to see the benefits of the loop and linear vectorization optimizations. This example is based on a dot product — a common mathematical operation often found in the domains of machine learning and physics.

Our sample program uses a dot product to compute the correlation of two vectors of data — in this case a sine wave and a cosine wave:

In this example there are three loops of interest which form the bodies of the sinTable, cosTable, and correlate functions. When the application is run it will print the time it takes to generate the data tables (dominated by sinTable and cosTable) as well as the time taken to compute the correlation (dominated by correlate).

The sinTable and cosTable loops are classic examples of “map” loops which are well handled by the GraalVM Enterprise Edition loop vectorization optimization. The correlate loop computes a dot product. This loop is computationally intensive and looks like it should benefit from vectorization. Because the loop is accumulating a floating point value, it is not well suited to the loop vectorizer — changing the order of floating point operations can produce subtly different results due to different rounding of intermediate results. If we do not care about these minor potential variances we can manually unroll the loop to expose intra-iteration parallelism which can be exploited by the linear vectorizer:

Note that we use a locally allocated array to hold the intermediate values so that the intermediate values are guaranteed to be adjacent in memory to make reading and writing the values amenable to grouping into SIMD operations. This array will not actually be allocated in memory — once optimization is complete the array component values will actually be held in a SIMD register. The following pseudo-code shows the result of vectorization:

You can compile and run the example as follows — the times are read from the final repetition of the table computation and correlation times.

> javac SignalCorrelation> java SignalCorrelationtable computation: 2407ms
sin/sin correlation: 1.0
sin/cos correlation: 0.9949874370939154
cos/sin correlation: 0.9949874370939154
cos/cos correlation: 1.0
correlation: 731ms
...

Having understood the program and exposed potential parallelism in the correlate function we can now run some experiments to observe the benefits of loop and linear vectorization. We begin by running the program on GraalVM Enterprise Edition 21.2 with different values for -Dgraal.VectorizeLoops which shows the effect of the loop vectorization optimization on the table computation time – approximately a 30% improvement:

> java -Dgraal.VectorizeLoops=false SignalCorrelation...table computation: 1227ms
sin/sin correlation: 1.0
sin/cos correlation: 0.9949874370939154
cos/sin correlation: 0.9949874370939154
cos/cos correlation: 1.0
correlation: 436ms
> java -Dgraal.VectorizeLoops=true SignalCorrelation...table computation: 878ms
sin/sin correlation: 1.0
sin/cos correlation: 0.9949874370939154
cos/sin correlation: 0.9949874370939154
cos/cos correlation: 1.0
correlation: 415ms

In GraalVM 21.2 the partial loop unrolling optimization may interact poorly with the linear vectorizer. We will fix this poor interaction in a future version of GraalVM, but for now we can disable partial loop unrolling using -Dgraal.PartialUnroll=false and run with and without -Dgraal.VectorizeSIMD=true. In doing this we see linear vectorization improves performance by approximately 30%:

> java -Dgraal.PartialUnroll=false SignalCorrelation...table computation: 851ms
sin/sin correlation: 1.0
sin/cos correlation: 0.9949874370939154
cos/sin correlation: 0.9949874370939154
cos/cos correlation: 1.0
correlation: 453ms
> java -Dgraal.PartialUnroll=false -Dgraal.VectorizeSIMD=true SignalCorrelation...table computation: 859ms
sin/sin correlation: 1.0
sin/cos correlation: 0.9949874370939154
cos/sin correlation: 0.9949874370939154
cos/cos correlation: 1.0
correlation: 350ms

Finally, we can run the best performing GraalVM configuration (-Dgraal.PartialUnroll=false -Dgraal.VectorizeSIMD=true) and compare the performance to that of HotSpot C2 where we see GraalVM Enterprise Edition 21.2 is approximately 40% faster overall:

> java -Dgraal.PartialUnroll=false -Dgraal.VectorizeSIMD=true SignalCorrelation...table computation: 881ms
sin/sin correlation: 1.0
sin/cos correlation: 0.9949874370939154
cos/sin correlation: 0.9949874370939154
cos/cos correlation: 1.0
correlation: 364ms
> java -server -XX:-UseJVMCICompiler -d64 SignalCorrelation...table computation: 2444ms
sin/sin correlation: 1.0
sin/cos correlation: 0.9949874370939154
cos/sin correlation: 0.9949874370939154
cos/cos correlation: 1.0
correlation: 537ms

We hope that you have enjoyed learning about some of the amazing optimization technology in GraalVM Enterprise Edition and that you are inspired to try out GraalVM on your own workloads to experience the performance it provides.

--

--