Java’s new Vector API: How fast is it? — Part 1

8 min readJul 23, 2023

This weekend I finally got around to take the new Vector API for a test run, and it was interesting to say the least.

Keep in mind though, that the Vector API is still incubating, so it will probably get better than what is displayed here.

GitHub: https://github.com/tomerr90/JavaVectorAPI-1

TL;DR (Conclusions & Future work)

For the most part, the JIT does a pretty good job vectorizing the code for you (at least for simple expressions), however there are some cases it doesn’t where we saw up to x16 speedup.
If you are doing a lot of vector calculations such as in AI, vector search/similarity etc, definitely give it a try, the gains are significant enough that some big projects (such as Apache Lucene) decided its worth adopting even though its still incubating.
It can also be used to reduce expensive branch instructions (if-statements) or eliminates them altogether!
There are some oddities in the results, mostly for small arrays but also for 32K in SimpleSum & ComplexExpression that I intend to investigate further in a follow-up post.

What is SIMD?

To understand what the new Vector API is, we first need to understand what SIMD (Single instruction, multiple data) is.

Lets look at the following Java code:

int[] a = new int[8] { ... };
int[] b = new int[8] { ... };
int[] c = new int[8];

for (int i = 0; i < 8; i++) {
    c[i] = a[i] + b[i];
}

Each cell in array c will be calculated using (roughly) 1 CPU instruction, meaning we’ll need 8 instructions or 8 CPU cycles in order to calculate array c.

This is the “default” way a CPU works and is called SISD (Single instruction, single data).

Modern CPUs have special registers called SIMD registers, these registers are special because they are very big, how big? Usually 256/512bit depending on the CPU.

Whats cool about these registers is that performing operations using these registers also takes just 1 instruction (CPU cycle) !

Lets assume our CPU has 256bit SIMD registers, an integer takes 32bit, meaning we can pack the entire array into one register (8 * 32 = 256) and calculate array c using just 1 instruction!

It looks something like this:

What is the new Vector API?

This type of optimization has existed in Java’s JIT’s arsenal for a while (Auto-Vectorization), however, this puts us at the mercy of the JIT and left us with no reliable way of writing code that uses SIMD instructions.

Until JEP-338 came around, now we can write the above code in a way that is guaranteed to use SIMD instructions:

int[] a = new int[8] { ... };
int[] b = new int[8] { ... };
int[] c = new int[8];

IntVector aVector = IntVector.fromArray(IntVector.SPECIES_PREFERRED, a, 0);
IntVector bVector = IntVector.fromArray(IntVector.SPECIES_PREFERRED, b, 0);
IntVector cVector = aVector.add(bVector);

cVector.intoArray(c, 0);

Alright, introductions out of way, lets take it for a spin!

Test setup

Intel(R) Xeon(R) Platinum 8375C CPU
256GB RAM / 16GB Heap
L1, L2, L3 caches = 1.5MB (for data), 40MB, 54MB
OpenJDK 17.0.7 (Vector API Second Incubator)

A note before we begin

Main memory accesses are expensive (compared to CPU cycles) and “cost” anywhere from 60–100+ CPU cycles per access, or 60–100+ of SIMD operations per access!

Given that, the bigger the arrays get, the less of them will fit into the CPU caches (L1/L2/L3) which means more main memory accesses which will cause the benefit of the Vector API to lessen.

Test 1: Simple Sum

private static final VectorSpecies<Float> SPECIES = FloatVector.SPECIES_PREFERRED;

public void arrays() {
    for (int i = 0; i < ARRAY_SIZE; i++) {
        result[i] = a[i] + b[i];
    }
}

public void vectors() {
    for (int i = 0; i < ARRAY_SIZE; i += SPECIES.length()) {
        FloatVector aVector = FloatVector.fromArray(SPECIES, a, i);
        FloatVector bVector = FloatVector.fromArray(SPECIES, b, i);
        
        aVector.add(bVector).intoArray(result, i);
    }
}

Benchmark          (arraySize)  Mode  Cnt          Score         Error  Units
SimpleSum.arrays            64  avgt   25         10.752 ±       0.030  ns/op
SimpleSum.arrays           512  avgt   25         68.539 ±       0.063  ns/op
SimpleSum.arrays          4096  avgt   25        617.995 ±       0.459  ns/op
SimpleSum.arrays         32768  avgt   25       5564.030 ±       7.887  ns/op
SimpleSum.arrays        262144  avgt   25     141673.490 ±     689.289  ns/op
SimpleSum.arrays       2097152  avgt   25     809495.940 ±   37403.943  ns/op
SimpleSum.arrays      16777216  avgt   25   13788130.242 ±  287193.094  ns/op
SimpleSum.arrays     134217728  avgt   25  112553636.860 ± 1241376.007  ns/op
SimpleSum.vectors           64  avgt   25          8.168 ±       0.089  ns/op
SimpleSum.vectors          512  avgt   25         69.964 ±       0.063  ns/op
SimpleSum.vectors         4096  avgt   25        671.500 ±       0.343  ns/op
SimpleSum.vectors        32768  avgt   25       6892.828 ±       1.494  ns/op
SimpleSum.vectors       262144  avgt   25     136210.386 ±     387.514  ns/op
SimpleSum.vectors      2097152  avgt   25     760036.142 ±   18445.247  ns/op
SimpleSum.vectors     16777216  avgt   25   13584702.141 ±  261731.672  ns/op
SimpleSum.vectors    134217728  avgt   25  111513654.191 ± 2826089.455  ns/op

Looks like using the Vector API had practically no benefit and in some cases even hurt performance, whats going on?

As mentioned earlier, SIMD is not a new concept to the JIT and has existed in it for a while in the form of Auto-Vectorization.

What we are likely seeing here is the result of the JIT re-writing the arrays method to also use SIMD instructions, which would make both of our tested methods very similar.

Lets disable that optimization (-XX:-UseSuperWord) and check again:

Benchmark                     (arraySize)  Mode  Cnt          Score        Error  Units
SimpleSumNoSuperWord.arrays           64  avgt   25         21.712 ±       0.027  ns/op
SimpleSumNoSuperWord.arrays          512  avgt   25        295.518 ±       0.053  ns/op
SimpleSumNoSuperWord.arrays         4096  avgt   25       2386.931 ±       0.822  ns/op
SimpleSumNoSuperWord.arrays        32768  avgt   25      20008.974 ±       7.453  ns/op
SimpleSumNoSuperWord.arrays       262144  avgt   25     438206.699 ±     177.381  ns/op
SimpleSumNoSuperWord.arrays      2097152  avgt   25    1032702.108 ±   21293.023  ns/op
SimpleSumNoSuperWord.arrays     16777216  avgt   25   12373763.662 ±  118032.230  ns/op
SimpleSumNoSuperWord.arrays    134217728  avgt   25  101077061.448 ±  633988.753  ns/op
SimpleSumNoSuperWord.vectors          64  avgt   25          8.184 ±       0.089  ns/op
SimpleSumNoSuperWord.vectors         512  avgt   25         69.964 ±       0.079  ns/op
SimpleSumNoSuperWord.vectors        4096  avgt   25        671.851 ±       0.517  ns/op
SimpleSumNoSuperWord.vectors       32768  avgt   25       7009.130 ±       2.189  ns/op
SimpleSumNoSuperWord.vectors      262144  avgt   25     163383.357 ±     882.544  ns/op
SimpleSumNoSuperWord.vectors     2097152  avgt   25     770218.071 ±   34027.675  ns/op
SimpleSumNoSuperWord.vectors    16777216  avgt   25   13564171.702 ±  255358.090  ns/op
SimpleSumNoSuperWord.vectors   134217728  avgt   25  112384608.402 ± 2028171.818  ns/op

This looks much better.

As we mentioned above, the bigger the arrays are, the more time we spend on expensive main memory accesses instead of actual calculations, and since this time is the same for both implementations, the speedup in the actual calculations becomes less significant the bigger the arrays get.

Test 2: Complex Expression

private static final VectorSpecies<Float> SPECIES = FloatVector.SPECIES_PREFERRED;

public void arrays() {
    for (int i = 0; i < arraySize; i++) {
        result[i] = ((a[i] * b[i] + b[i] - b[i]) / b[i]) +
                     a[i] + a[i] + a[i] + a[i];
    }
}

public void vectors() {
    for (int i = 0; i < arraySize; i += SPECIES.length()) {
        FloatVector aVector = FloatVector.fromArray(SPECIES, a, i);
        FloatVector bVector = FloatVector.fromArray(SPECIES, b, i);
        
        aVector
            .mul(bVector)
            .add(bVector)
            .sub(bVector)
            .div(bVector)
            .add(aVector)
            .add(aVector)
            .add(aVector)
            .add(aVector)
            .intoArray(result, i);
    }
}

Benchmark                  (arraySize)  Mode  Cnt          Score         Error  Units
ComplexExpression.arrays            64  avgt   25         30.624 ±       0.219  ns/op
ComplexExpression.arrays           512  avgt   25        120.797 ±       0.526  ns/op
ComplexExpression.arrays          4096  avgt   25        946.221 ±      20.957  ns/op
ComplexExpression.arrays         32768  avgt   25       6737.679 ±     160.232  ns/op
ComplexExpression.arrays        262144  avgt   25     361199.573 ±     760.639  ns/op
ComplexExpression.arrays       2097152  avgt   25     762356.007 ±   33823.662  ns/op
ComplexExpression.arrays      16777216  avgt   25   12627542.845 ±  296061.846  ns/op
ComplexExpression.arrays     134217728  avgt   25  106105812.376 ± 2503255.353  ns/op
ComplexExpression.vectors           64  avgt   25         12.458 ±       0.017  ns/op
ComplexExpression.vectors          512  avgt   25         98.974 ±       0.180  ns/op
ComplexExpression.vectors         4096  avgt   25        991.589 ±       4.627  ns/op
ComplexExpression.vectors        32768  avgt   25       7933.439 ±     419.246  ns/op
ComplexExpression.vectors       262144  avgt   25     369727.074 ±    1024.303  ns/op
ComplexExpression.vectors      2097152  avgt   25     765980.046 ±   50602.787  ns/op
ComplexExpression.vectors     16777216  avgt   25   12503692.775 ±  208763.290  ns/op
ComplexExpression.vectors    134217728  avgt   25  102388046.086 ± 2165302.078  ns/op

Seems like even with a more complex expression the JIT is still able to auto vectorize our code, so lets turn it off again:

Benchmark                    (arraySize)  Mode  Cnt          Score         Error  Units
ComplexExpressionNSW.arrays           64  avgt   25         86.835 ±       0.739  ns/op
ComplexExpressionNSW.arrays          512  avgt   25        720.777 ±       7.063  ns/op
ComplexExpressionNSW.arrays         4096  avgt   25       5785.453 ±      70.299  ns/op
ComplexExpressionNSW.arrays        32768  avgt   25      46618.108 ±     470.062  ns/op
ComplexExpressionNSW.arrays       262144  avgt   25    1104864.860 ±     338.645  ns/op
ComplexExpressionNSW.arrays      2097152  avgt   25    2875094.929 ±   25523.785  ns/op
ComplexExpressionNSW.arrays     16777216  avgt   25   24870112.853 ±  195418.788  ns/op
ComplexExpressionNSW.arrays    134217728  avgt   25  198879883.287 ± 1454468.451  ns/op
ComplexExpressionNSW.vectors          64  avgt   25         12.456 ±       0.015  ns/op
ComplexExpressionNSW.vectors         512  avgt   25         98.940 ±       0.181  ns/op
ComplexExpressionNSW.vectors        4096  avgt   25       1084.581 ±       5.529  ns/op
ComplexExpressionNSW.vectors       32768  avgt   25       8277.051 ±     482.405  ns/op
ComplexExpressionNSW.vectors      262144  avgt   25     367207.320 ±    1804.581  ns/op
ComplexExpressionNSW.vectors     2097152  avgt   25     778058.584 ±   34984.022  ns/op
ComplexExpressionNSW.vectors    16777216  avgt   25   12248559.300 ±   76245.541  ns/op
ComplexExpressionNSW.vectors   134217728  avgt   25   99745414.077 ±  958472.028  ns/op

As expected, we see that the more complicated the expression, the more beneficial it is to use the Vector API.

Test 3: Array stats

So the JIT is pretty good at automatically vectorizing mathematical expressions, but how will it fair with if-statements?

Given 2 arrays, we would like to count how many cells of the first array are equal, greater-than and lower-than the corresponding cell in the second array.

public void arrays() {    
    for (int i = 0; i < arraySize; i++) {
        if (a[i] == b[i]) {
            eq++;
        } else if (a[i] > b[i]) {
            gt++;
        } else {
            lt++;
        }
    }
}

public void vectors() {
    for (int i = 0; i < arraySize; i += SPECIES.length()) {
        FloatVector aVector = FloatVector.fromArray(SPECIES, a, i);
        FloatVector bVector = FloatVector.fromArray(SPECIES, b, i);
        int ltCount = aVector.lt(bVector).trueCount();
        int eqCount = aVector.eq(bVector).trueCount();
        eq += eqCount;
        lt += ltCount;
        gt += SPECIES.length() - ltCount - eqCount;
    }
}

Benchmark           (arraySize)  Mode  Cnt          Score        Error  Units
ArrayStats.arrays            64  avgt   25         78.085 ±       0.883  ns/op
ArrayStats.arrays           512  avgt   25        480.687 ±      17.253  ns/op
ArrayStats.arrays          4096  avgt   25       3520.486 ±      14.364  ns/op
ArrayStats.arrays         32768  avgt   25     129068.801 ±    4531.479  ns/op
ArrayStats.arrays        262144  avgt   25    1247707.572 ±    1294.815  ns/op
ArrayStats.arrays       2097152  avgt   25   10094594.371 ±   47276.090  ns/op
ArrayStats.arrays      16777216  avgt   25   81380875.237 ± 1365443.776  ns/op
ArrayStats.arrays     134217728  avgt   25  664508819.256 ± 4166652.382  ns/op
ArrayStats.vectors           64  avgt   25         38.399 ±       0.254  ns/op
ArrayStats.vectors          512  avgt   25        130.115 ±       0.022  ns/op
ArrayStats.vectors         4096  avgt   25        830.619 ±       0.359  ns/op
ArrayStats.vectors        32768  avgt   25       8389.896 ±       2.192  ns/op
ArrayStats.vectors       262144  avgt   25      76565.740 ±     705.466  ns/op
ArrayStats.vectors      2097152  avgt   25     623986.030 ±   20350.313  ns/op
ArrayStats.vectors     16777216  avgt   25    8397410.833 ±   75566.255  ns/op
ArrayStats.vectors    134217728  avgt   25   70185587.855 ±  640284.849  ns/op

Aha! Seems like the JIT has met its match and we are able to get up to 16x speedup by using the new Vector API, and even for very large arrays we can get almost 10x speedup!

The reason for this speedup is that the vectors implementation is what you would call “branchless code”, which utilizes the CPU better.

Thats it from me!

Thank you for tuning in, until next time, stay efficient!