Java’s new Vector API: How fast is it? — Part 2

Tomer Zeltzer

7 min readJul 29, 2023

After we had some fun with the new Vector API in Part 1:

Java’s new Vector API: How fast is it? — Part 1

This weekend I finally got around to take the new Vector API for a test run, and it was interesting to say the least.

medium.com

There were some oddities found in the tests that I’d like to drill into and explain, but first…

TL;DR (Conclusions)

Loop unrolling for the Vector API can be improved, although that is pretty rare.
With the Vector API the JIT is able to create more efficient code, however the difference is only significant for smaller array.
Branch mispredictions are painful!

What is Loop Unrolling?

Loop Unrolling is an optimization technique performed automatically by the JIT, and sometimes manually by the developer.

Lets say I have a method foo which I want to call 100 times, I can do the following:

public void foo(int i) { ... };

foo(0);
foo(1);
...
foo(99);

But unless you want people to look at you like:

We need something else… we can use a for loop!

public void foo(int i) { ... };

for (int i = 0; i < 100; i++) {
  foo(i);
}

But as the saying goes, there are no free lunches, and now we have added an overhead of roughly 100 condition checks (i < 100), 100 variable increments (i++) and 100 jumps.

So what can we do about that overhead? Combine both of the options above!

public void foo(int i) { ... };

for (int i = 0; i < 20; i += 5) {
  foo(i);
  foo(i + 1);
  foo(i + 2);
  foo(i + 3);
  foo(i + 4);
}

Now we only have 20 condition checks and 20 jumps (still 100 increments), and we have successfully “unrolled” the loop.

Furthermore, we have increased ILP (instruction level parallelism, not to be confused with concurrency or multithreading!) since we now have more independent instructions (each call of foo) that can run in parallel.

We can “unroll” the loop further or less by changing the number of calls to foo accordingly, its a balance between performance gain from unrolling and code size because more unrolling → more code → less code fits in the code cache → more code cache misses → performance suffers.

Reminder of the test setup

Intel(R) Xeon(R) Platinum 8375C CPU
256GB RAM / 16GB Heap
L1, L2, L3 caches = 1.5MB (for data), 40MB, 54MB
OpenJDK 17.0.7 (Vector API Second Incubator)

Lets revisit some results…

These are the results we got without turning off any JIT optimizations:

Questions:

Why is the speedup more pronounced for 64 elements?
Why is the vectors implementation significantly slower for 32K elements?

The answers for these questions are the same for both SimpleSum & ComplexExpressions so I’ll focus on SimpleSum for simplicity.

Lets take a look under the JIT’s hood and see what code is actually running, for that we’ll add the following JVM flags:

-XX:+UnlockDiagnosticVMOptions
-XX:+PrintAssembly
-Xlog:class+load=info
-XX:+LogCompilation

arrays:


             L0000: vmovss 0x10(%rdi,%rdx,4),%xmm0
0x00007fa0b9019cf0: vaddss 0x10(%rsi,%rdx,4),%xmm0,%xmm1
0x00007fa0b9019cf6: vmovss %xmm1,0x10(%rcx,%rdx,4)  
                                                    
0x00007fa0b9019cfc: inc %edx                        
                                                    
0x00007fa0b9019cfe: cmp %r11d,%edx
0x00007fa0b9019d01: jl L0000                        
                                                    

             L0002: vmovdqu 0x10(%rdi,%rdx,4),%xmm0
0x00007fa0b9019d46: vaddps 0x10(%rsi,%rdx,4),%xmm0,%xmm0
0x00007fa0b9019d4c: vmovdqu %xmm0,0x10(%rcx,%rdx,4)  
                                                     
0x00007fa0b9019d52: add $0x4,%edx                    
                                                     
0x00007fa0b9019d55: cmp %r11d,%edx
0x00007fa0b9019d58: jl L0002                         
                                                     

             L0004: vmovss 0x10(%rdi,%rdx,4),%xmm0
0x00007fa0b9019d72: vaddss 0x10(%rsi,%rdx,4),%xmm0,%xmm1
0x00007fa0b9019d78: vmovss %xmm1,0x10(%rcx,%rdx,4)  
                                                    
0x00007fa0b9019d7e: inc %edx                        
                                                    
0x00007fa0b9019d80: cmp %ebp,%edx
0x00007fa0b9019d82: jl L0004

Our loop was broken into 3 smaller loops, and while it would seem we are using SIMD instructions in all of them (judging from the use of xmm0/xmm1) we actually are not.

The vmovss/vaddss instructions in L0000/L0004 operate on SIMD registers but they actually only use the lowest 32bits of these registers, so we are actually calculating only 1 cell at a time as opposed to L0002 where we use “true” SIMD instructions and calculate 4 cells at a time.

Another hint for this is that in L0000/L0004 we increment the loop counter by 1 (0x00007fa0b9019cfc / 0x00007fa0b9019d7e) while in L0002 we increment it by 4 (0x00007fa0b9019d52).

I am not sure why the above is made that way, but my theory is that it was done so we get good performance on older SIMD hardware (SSE) as well as new one, because the performance of older SIMD hardware was very bad for misaligned memory accesses, so the first loop goes cell by cell in order to reach some memory alignment, then does the main loop using “true” SIMD instructions and the final loop is for the remaining elements.

Further more, in the “main loop” (L0002) we seem to only be calculating 4 cells (128bit) at a time, which we can spot quickly by the fact we increment the loop counter by 4 (0x00007fa0b9019d52), while my machine has 256bit and 512bit SIMD registers, so we can do more per instruction.

vectors:


             L0001: vmovdqu32 0x10(%r8,%rdx,4),%zmm0
0x00007f45f502794b: vaddps 0x10(%rcx,%rdx,4),%zmm0,%zmm0
0x00007f45f5027956: vmovdqu32 %zmm0,0x10(%r9,%rdx,4)  
                                                      
                                                      
0x00007f45f5027961: add $0x10,%edx
0x00007f45f5027964: cmp %edi,%edx
0x00007f45f5027966: jl L0001

It seems that for the vectors implementation, the JIT was able to understand there is no need for the memory alignment (my SIMD hardware is newer) part and is only using one loop that calculates 16 cells (512bit) cells at a time.

Which is why we are seeing such a speedup for 64 elements.

Now lets a look at the 32K elements:

arrays:


             L0000: vmovss 0x10(%rax,%r11,4),%xmm1
0x00007f37489bd674: vaddss 0x10(%rdi,%r11,4),%xmm1,%xmm0
0x00007f37489bd67b: vmovss %xmm0,0x10(%rdx,%r11,4)  
                                                    
0x00007f37489bd682: inc %r11d                       
                                                    
0x00007f37489bd685: cmp %r9d,%r11d
0x00007f37489bd688: jl L0000                        
                                                    

// Unrolled loop
// c[i...i+15] = a[i...i+15] + b[i...i+15]
             L0002: vmovdqu32 0x10(%rax,%r11,4),%zmm0
0x00007f37489bd6db: vaddps 0x10(%rdi,%r11,4),%zmm0,%zmm0
0x00007f37489bd6e6: vmovdqu32 %zmm0,0x10(%rdx,%r11,4)

// c[i+16...i+31] = a[i+16...i+31] + b[i+16...i+31]
0x00007f37489bd6f1: vmovdqu32 0x50(%rax,%r11,4),%zmm0
0x00007f37489bd6fc: vaddps 0x50(%rdi,%r11,4),%zmm0,%zmm0
0x00007f37489bd707: vmovdqu32 %zmm0,0x50(%rdx,%r11,4)

// c[i+32...i+47] = a[i+32...i+47] + b[i+32...i+47]
0x00007f37489bd712: vmovdqu32 0x90(%rax,%r11,4),%zmm0
0x00007f37489bd71d: vaddps 0x90(%rdi,%r11,4),%zmm0,%zmm0
0x00007f37489bd728: vmovdqu32 %zmm0,0x90(%rdx,%r11,4)

// c[i+48...i+63] = a[i+48...i+63] + b[i+48...i+63]
0x00007f37489bd733: vmovdqu32 0xd0(%rax,%r11,4),%zmm0
0x00007f37489bd73e: vaddps 0xd0(%rdi,%r11,4),%zmm0,%zmm0
0x00007f37489bd749: vmovdqu32 %zmm0,0xd0(%rdx,%r11,4)  
                                                       
0x00007f37489bd754: add $0x40,%r11d                    
                                                       
0x00007f37489bd758: cmp %esi,%r11d
0x00007f37489bd75b: jl L0002                           
                                                       

             L0003: vmovdqu32 0x10(%rax,%r11,4),%zmm0
0x00007f37489bd793: vaddps 0x10(%rdi,%r11,4),%zmm0,%zmm0
0x00007f37489bd79e: vmovdqu32 %zmm0,0x10(%rdx,%r11,4)  
                                                       
0x00007f37489bd7a9: add $0x10,%r11d                    
                                                       
0x00007f37489bd7ad: cmp %r10d,%r11d
0x00007f37489bd7b0: jl L0003                           
                                                       

             L0005: vmovss 0x10(%rax,%r11,4),%xmm1
0x00007f37489bd7bf: vaddss 0x10(%rdi,%r11,4),%xmm1,%xmm0
0x00007f37489bd7c6: vmovss %xmm0,0x10(%rdx,%r11,4)    
                                                      
0x00007f37489bd7cd: inc %r11d                         
                                                      
0x00007f37489bd7d0: cmp %ebp,%r11d
0x00007f37489bd7d3: jl L0005

This time, we see that the JIT added an extra loop L0003 and also did some loop-unrolling in L0002, the rest is mostly the same as for 64 elements.

Based on the loop counter increment 0x00007f37489bd754 and the use of zmm registers which are usually 512bit, we can deduce each loop of L0002 handles 64 elements or 4 vectors since each float is 32bit, 512bit / 32bit = 16 elements per vector.

But what if after the alignment loop (L0000) and the main loop (L0002) we are left with, for example, 62 elements?

We cant do another iteration of L0002 because it needs 64 elements and we don’t want to calculate these 62 elements 1 cell at a time, so we’ll calculate them 1 vector (16 elements) at a time and do the “leftovers” 1 cell at a time.

Which is exactly what L0003 and L0005 are for.

vectors:


             L0001: vmovdqu32 0x10(%rdx,%r11,4),%zmm0
0x00007f16c09d314e: vaddps 0x10(%rdi,%r11,4),%zmm0,%zmm0
0x00007f16c09d3159: vmovdqu32 %zmm0,0x10(%rcx,%r11,4)  
                                                       
0x00007f16c09d3164: add $0x10,%r11d
0x00007f16c09d3168: cmp %eax,%r11d
0x00007f16c09d316b: jl L0001

For some reason, unlike for the arrays implementation, the JIT decided not to do any loop-unrolling, which is why the vectors implementation is so much slower for 32K elements.

Questions:

Why is the speedup significantly worse for arrays of size up to 4K?
Why does the speedup fall starting from 16M elements?

As mentioned in the previous post, the reason for the speedup is that the vectors implementation is branchless, meaning we are not paying the penalty for branch mispredictions.

Lets use the Linux Perf profiler to look at the branch misprediction rate for the arrays implementation for 4K/32K/16M elements:

4K:
2240489501040      branches:u                #    4.454 G/sec
     99890832      branch-misses:u           #    0.00% of all branches

32K:
 496365740442      branches:u                #  986.617 M/sec
  48553678225      branch-misses:u           #    9.78% of all branches

16M:
 402842299626      branches:u                #  794.185 M/sec
  51458009370      branch-misses:u           #   12.77% of all branches

It seems the branch predictor is able to predict branches pretty well up to 4K random elements, however, from 32K elements the branch misprediction rate jumps dramatically, which is why we see such an increase in the speedup.

However, it does not explain why the speedup drops again from 16M elements, the misprediction rate keep going up so shouldn’t the speedup follow? Not quite.

During the test we access 3 arrays (a, b and result) each has 16M floats, that is 3 * 16M * 4 (bytes per float) = 192MB while the L3 cache is only 54MB which means we are probably accessing the main memory a lot.

The speedup becomes less pronounced for 16M because:

Main memory accesses are expensive, especially compared to the instructions we run here
More L3 misses → More main memory accesses → The total run time of the test becomes more dominated by the memory access time (which the Vector API has no effect over) and less by the time it takes to run the actual instructions, which lessens the effects of the Vector API.

Lets verify that with the Linux Perf profiler as well:

2M:
   1386935      LLC-load-misses:u         #    0.86% of all LL-cache accesses

16M:
  86374788      LLC-load-misses:u         #   65.85% of all LL-cache accesses

LLC stands for Last-Level-Cache, or L3 in this case, and indeed the miss rate for the L3 cache is A LOT higher for 16M elements.

That’s it from me!

Thank you for tuning in, until next time, stay efficient!