Java’s new Vector API: How fast is it? — Part 1

Tomer Zeltzer
8 min readJul 23, 2023

--

This weekend I finally got around to take the new Vector API for a test run, and it was interesting to say the least.

Keep in mind though, that the Vector API is still incubating, so it will probably get better than what is displayed here.

GitHub: https://github.com/tomerr90/JavaVectorAPI-1

TL;DR (Conclusions & Future work)

  • For the most part, the JIT does a pretty good job vectorizing the code for you (at least for simple expressions), however there are some cases it doesn’t where we saw up to x16 speedup.
  • If you are doing a lot of vector calculations such as in AI, vector search/similarity etc, definitely give it a try, the gains are significant enough that some big projects (such as Apache Lucene) decided its worth adopting even though its still incubating.
  • It can also be used to reduce expensive branch instructions (if-statements) or eliminates them altogether!
  • There are some oddities in the results, mostly for small arrays but also for 32K in SimpleSum & ComplexExpression that I intend to investigate further in a follow-up post.

What is SIMD?

To understand what the new Vector API is, we first need to understand what SIMD (Single instruction, multiple data) is.

Lets look at the following Java code:

int[] a = new int[8] { ... };
int[] b = new int[8] { ... };
int[] c = new int[8];

for (int i = 0; i < 8; i++) {
c[i] = a[i] + b[i];
}

Each cell in array c will be calculated using (roughly) 1 CPU instruction, meaning we’ll need 8 instructions or 8 CPU cycles in order to calculate array c.

This is the “default” way a CPU works and is called SISD (Single instruction, single data).

Modern CPUs have special registers called SIMD registers, these registers are special because they are very big, how big? Usually 256/512bit depending on the CPU.

Whats cool about these registers is that performing operations using these registers also takes just 1 instruction (CPU cycle) !

Lets assume our CPU has 256bit SIMD registers, an integer takes 32bit, meaning we can pack the entire array into one register (8 * 32 = 256) and calculate array c using just 1 instruction!

It looks something like this:

What is the new Vector API?

This type of optimization has existed in Java’s JIT’s arsenal for a while (Auto-Vectorization), however, this puts us at the mercy of the JIT and left us with no reliable way of writing code that uses SIMD instructions.

Until JEP-338 came around, now we can write the above code in a way that is guaranteed to use SIMD instructions:

int[] a = new int[8] { ... };
int[] b = new int[8] { ... };
int[] c = new int[8];

IntVector aVector = IntVector.fromArray(IntVector.SPECIES_PREFERRED, a, 0);
IntVector bVector = IntVector.fromArray(IntVector.SPECIES_PREFERRED, b, 0);
IntVector cVector = aVector.add(bVector);

cVector.intoArray(c, 0);

Alright, introductions out of way, lets take it for a spin!

Test setup

  • Intel(R) Xeon(R) Platinum 8375C CPU
  • 256GB RAM / 16GB Heap
  • L1, L2, L3 caches = 1.5MB (for data), 40MB, 54MB
  • OpenJDK 17.0.7 (Vector API Second Incubator)

A note before we begin

Main memory accesses are expensive (compared to CPU cycles) and “cost” anywhere from 60–100+ CPU cycles per access, or 60–100+ of SIMD operations per access!

Given that, the bigger the arrays get, the less of them will fit into the CPU caches (L1/L2/L3) which means more main memory accesses which will cause the benefit of the Vector API to lessen.

Test 1: Simple Sum

private static final VectorSpecies<Float> SPECIES = FloatVector.SPECIES_PREFERRED;

public void arrays() {
for (int i = 0; i < ARRAY_SIZE; i++) {
result[i] = a[i] + b[i];
}
}

public void vectors() {
for (int i = 0; i < ARRAY_SIZE; i += SPECIES.length()) {
FloatVector aVector = FloatVector.fromArray(SPECIES, a, i);
FloatVector bVector = FloatVector.fromArray(SPECIES, b, i);

aVector.add(bVector).intoArray(result, i);
}
}
Benchmark          (arraySize)  Mode  Cnt          Score         Error  Units
SimpleSum.arrays 64 avgt 25 10.752 ± 0.030 ns/op
SimpleSum.arrays 512 avgt 25 68.539 ± 0.063 ns/op
SimpleSum.arrays 4096 avgt 25 617.995 ± 0.459 ns/op
SimpleSum.arrays 32768 avgt 25 5564.030 ± 7.887 ns/op
SimpleSum.arrays 262144 avgt 25 141673.490 ± 689.289 ns/op
SimpleSum.arrays 2097152 avgt 25 809495.940 ± 37403.943 ns/op
SimpleSum.arrays 16777216 avgt 25 13788130.242 ± 287193.094 ns/op
SimpleSum.arrays 134217728 avgt 25 112553636.860 ± 1241376.007 ns/op
SimpleSum.vectors 64 avgt 25 8.168 ± 0.089 ns/op
SimpleSum.vectors 512 avgt 25 69.964 ± 0.063 ns/op
SimpleSum.vectors 4096 avgt 25 671.500 ± 0.343 ns/op
SimpleSum.vectors 32768 avgt 25 6892.828 ± 1.494 ns/op
SimpleSum.vectors 262144 avgt 25 136210.386 ± 387.514 ns/op
SimpleSum.vectors 2097152 avgt 25 760036.142 ± 18445.247 ns/op
SimpleSum.vectors 16777216 avgt 25 13584702.141 ± 261731.672 ns/op
SimpleSum.vectors 134217728 avgt 25 111513654.191 ± 2826089.455 ns/op

Looks like using the Vector API had practically no benefit and in some cases even hurt performance, whats going on?

As mentioned earlier, SIMD is not a new concept to the JIT and has existed in it for a while in the form of Auto-Vectorization.

What we are likely seeing here is the result of the JIT re-writing the arrays method to also use SIMD instructions, which would make both of our tested methods very similar.

Lets disable that optimization (-XX:-UseSuperWord) and check again:

Benchmark                     (arraySize)  Mode  Cnt          Score        Error  Units
SimpleSumNoSuperWord.arrays 64 avgt 25 21.712 ± 0.027 ns/op
SimpleSumNoSuperWord.arrays 512 avgt 25 295.518 ± 0.053 ns/op
SimpleSumNoSuperWord.arrays 4096 avgt 25 2386.931 ± 0.822 ns/op
SimpleSumNoSuperWord.arrays 32768 avgt 25 20008.974 ± 7.453 ns/op
SimpleSumNoSuperWord.arrays 262144 avgt 25 438206.699 ± 177.381 ns/op
SimpleSumNoSuperWord.arrays 2097152 avgt 25 1032702.108 ± 21293.023 ns/op
SimpleSumNoSuperWord.arrays 16777216 avgt 25 12373763.662 ± 118032.230 ns/op
SimpleSumNoSuperWord.arrays 134217728 avgt 25 101077061.448 ± 633988.753 ns/op
SimpleSumNoSuperWord.vectors 64 avgt 25 8.184 ± 0.089 ns/op
SimpleSumNoSuperWord.vectors 512 avgt 25 69.964 ± 0.079 ns/op
SimpleSumNoSuperWord.vectors 4096 avgt 25 671.851 ± 0.517 ns/op
SimpleSumNoSuperWord.vectors 32768 avgt 25 7009.130 ± 2.189 ns/op
SimpleSumNoSuperWord.vectors 262144 avgt 25 163383.357 ± 882.544 ns/op
SimpleSumNoSuperWord.vectors 2097152 avgt 25 770218.071 ± 34027.675 ns/op
SimpleSumNoSuperWord.vectors 16777216 avgt 25 13564171.702 ± 255358.090 ns/op
SimpleSumNoSuperWord.vectors 134217728 avgt 25 112384608.402 ± 2028171.818 ns/op

This looks much better.

As we mentioned above, the bigger the arrays are, the more time we spend on expensive main memory accesses instead of actual calculations, and since this time is the same for both implementations, the speedup in the actual calculations becomes less significant the bigger the arrays get.

Test 2: Complex Expression

private static final VectorSpecies<Float> SPECIES = FloatVector.SPECIES_PREFERRED;

public void arrays() {
for (int i = 0; i < arraySize; i++) {
result[i] = ((a[i] * b[i] + b[i] - b[i]) / b[i]) +
a[i] + a[i] + a[i] + a[i];
}
}

public void vectors() {
for (int i = 0; i < arraySize; i += SPECIES.length()) {
FloatVector aVector = FloatVector.fromArray(SPECIES, a, i);
FloatVector bVector = FloatVector.fromArray(SPECIES, b, i);

aVector
.mul(bVector)
.add(bVector)
.sub(bVector)
.div(bVector)
.add(aVector)
.add(aVector)
.add(aVector)
.add(aVector)
.intoArray(result, i);
}
}
Benchmark                  (arraySize)  Mode  Cnt          Score         Error  Units
ComplexExpression.arrays 64 avgt 25 30.624 ± 0.219 ns/op
ComplexExpression.arrays 512 avgt 25 120.797 ± 0.526 ns/op
ComplexExpression.arrays 4096 avgt 25 946.221 ± 20.957 ns/op
ComplexExpression.arrays 32768 avgt 25 6737.679 ± 160.232 ns/op
ComplexExpression.arrays 262144 avgt 25 361199.573 ± 760.639 ns/op
ComplexExpression.arrays 2097152 avgt 25 762356.007 ± 33823.662 ns/op
ComplexExpression.arrays 16777216 avgt 25 12627542.845 ± 296061.846 ns/op
ComplexExpression.arrays 134217728 avgt 25 106105812.376 ± 2503255.353 ns/op
ComplexExpression.vectors 64 avgt 25 12.458 ± 0.017 ns/op
ComplexExpression.vectors 512 avgt 25 98.974 ± 0.180 ns/op
ComplexExpression.vectors 4096 avgt 25 991.589 ± 4.627 ns/op
ComplexExpression.vectors 32768 avgt 25 7933.439 ± 419.246 ns/op
ComplexExpression.vectors 262144 avgt 25 369727.074 ± 1024.303 ns/op
ComplexExpression.vectors 2097152 avgt 25 765980.046 ± 50602.787 ns/op
ComplexExpression.vectors 16777216 avgt 25 12503692.775 ± 208763.290 ns/op
ComplexExpression.vectors 134217728 avgt 25 102388046.086 ± 2165302.078 ns/op

Seems like even with a more complex expression the JIT is still able to auto vectorize our code, so lets turn it off again:

Benchmark                    (arraySize)  Mode  Cnt          Score         Error  Units
ComplexExpressionNSW.arrays 64 avgt 25 86.835 ± 0.739 ns/op
ComplexExpressionNSW.arrays 512 avgt 25 720.777 ± 7.063 ns/op
ComplexExpressionNSW.arrays 4096 avgt 25 5785.453 ± 70.299 ns/op
ComplexExpressionNSW.arrays 32768 avgt 25 46618.108 ± 470.062 ns/op
ComplexExpressionNSW.arrays 262144 avgt 25 1104864.860 ± 338.645 ns/op
ComplexExpressionNSW.arrays 2097152 avgt 25 2875094.929 ± 25523.785 ns/op
ComplexExpressionNSW.arrays 16777216 avgt 25 24870112.853 ± 195418.788 ns/op
ComplexExpressionNSW.arrays 134217728 avgt 25 198879883.287 ± 1454468.451 ns/op
ComplexExpressionNSW.vectors 64 avgt 25 12.456 ± 0.015 ns/op
ComplexExpressionNSW.vectors 512 avgt 25 98.940 ± 0.181 ns/op
ComplexExpressionNSW.vectors 4096 avgt 25 1084.581 ± 5.529 ns/op
ComplexExpressionNSW.vectors 32768 avgt 25 8277.051 ± 482.405 ns/op
ComplexExpressionNSW.vectors 262144 avgt 25 367207.320 ± 1804.581 ns/op
ComplexExpressionNSW.vectors 2097152 avgt 25 778058.584 ± 34984.022 ns/op
ComplexExpressionNSW.vectors 16777216 avgt 25 12248559.300 ± 76245.541 ns/op
ComplexExpressionNSW.vectors 134217728 avgt 25 99745414.077 ± 958472.028 ns/op

As expected, we see that the more complicated the expression, the more beneficial it is to use the Vector API.

Test 3: Array stats

So the JIT is pretty good at automatically vectorizing mathematical expressions, but how will it fair with if-statements?

Given 2 arrays, we would like to count how many cells of the first array are equal, greater-than and lower-than the corresponding cell in the second array.

public void arrays() {    
for (int i = 0; i < arraySize; i++) {
if (a[i] == b[i]) {
eq++;
} else if (a[i] > b[i]) {
gt++;
} else {
lt++;
}
}
}

public void vectors() {
for (int i = 0; i < arraySize; i += SPECIES.length()) {
FloatVector aVector = FloatVector.fromArray(SPECIES, a, i);
FloatVector bVector = FloatVector.fromArray(SPECIES, b, i);
int ltCount = aVector.lt(bVector).trueCount();
int eqCount = aVector.eq(bVector).trueCount();
eq += eqCount;
lt += ltCount;
gt += SPECIES.length() - ltCount - eqCount;
}
}
Benchmark           (arraySize)  Mode  Cnt          Score        Error  Units
ArrayStats.arrays 64 avgt 25 78.085 ± 0.883 ns/op
ArrayStats.arrays 512 avgt 25 480.687 ± 17.253 ns/op
ArrayStats.arrays 4096 avgt 25 3520.486 ± 14.364 ns/op
ArrayStats.arrays 32768 avgt 25 129068.801 ± 4531.479 ns/op
ArrayStats.arrays 262144 avgt 25 1247707.572 ± 1294.815 ns/op
ArrayStats.arrays 2097152 avgt 25 10094594.371 ± 47276.090 ns/op
ArrayStats.arrays 16777216 avgt 25 81380875.237 ± 1365443.776 ns/op
ArrayStats.arrays 134217728 avgt 25 664508819.256 ± 4166652.382 ns/op
ArrayStats.vectors 64 avgt 25 38.399 ± 0.254 ns/op
ArrayStats.vectors 512 avgt 25 130.115 ± 0.022 ns/op
ArrayStats.vectors 4096 avgt 25 830.619 ± 0.359 ns/op
ArrayStats.vectors 32768 avgt 25 8389.896 ± 2.192 ns/op
ArrayStats.vectors 262144 avgt 25 76565.740 ± 705.466 ns/op
ArrayStats.vectors 2097152 avgt 25 623986.030 ± 20350.313 ns/op
ArrayStats.vectors 16777216 avgt 25 8397410.833 ± 75566.255 ns/op
ArrayStats.vectors 134217728 avgt 25 70185587.855 ± 640284.849 ns/op

Aha! Seems like the JIT has met its match and we are able to get up to 16x speedup by using the new Vector API, and even for very large arrays we can get almost 10x speedup!

The reason for this speedup is that the vectors implementation is what you would call “branchless code”, which utilizes the CPU better.

Thats it from me!

Thank you for tuning in, until next time, stay efficient!

--

--