Auto-vectorize C and C++ code

Auto-vectorization of for loops results in significant performance improvement

EventHelix
Software Design
6 min readDec 31, 2021

--

Developers can take advantage of built-in auto-vectorization support in GCC. Let’s explore vectorization with the GCC 11.2 compiler with the following compiler options to enable auto-vectorization.

Vector operations with 8 iterations

Let’s start with the following code. We are performing vector operations where the compiler knows that the code will loop 8 times (VECLEN). Another thing to note is the use of the __restrict keyword. This keyword tells the compiler that the pointer c does not overlap with a and b. Thus, the compiler does not need to keep updating intermediate results.

The compiler generates no loop, and the 8 operations are performed with a single set of instructions.

Here the compiler is performing the following operations:

vmovups ymm0, YMMWORD PTR [rdi]

The processor has copied the entire array of 8 entries of a from memory into 8 packed floats stored in the ymm0 register. Note that rdi register points to the start of the a array.

vmulps ymm0, ymm0, YMMWORD PTR [rsi]

Here the processor is performing a vector multiplication for a (ymm0) and b (pointer to b is stored in the rsi register). The final result is saved in the ymm0 register.

vmovups YMMWORD PTR [rdx], ymm0

Now the processor is copying the results to the array c. (pointer to c is stored in the rdx register). This is the last step before ret returns from the function.

View in compiler explorer.

Vector operations with 16 iterations

When the code is changed to 16 iterations, the generated code simply repeats the block we saw with 8 iterations.

The generated code unrolls the loop.

View in compiler explorer.

Vector operations with 128 iterations

Even with VECLEN set to 128, there is still no loop in sight.

View in compiler explorer.

Vector operations with 256 iterations

With VECLEN set to 256, we do get a loop in the generated code. Note that even with the loop, the processor is performing 8 multiplications per loop. Thus, the generated loop iterates only 32 times.

View in compiler explorer.

Vector operations with n iterations

In the above examples, we have considered scenarios where the loop iteration count is known at compile time. Now let’s consider the case where n iterations have to be performed.

The compiler cannot optimize the loop in advance, so we get some run-time checks and a lot of code.

View in compiler explorer.

Vector operations when the __restrict keyword is not used

Finally, let’s see the impact of not using __restrict keyword. Consider our first 8 iteration loop without __restrict prefix.

Now the compiler cannot assume that a and b do not overlap with c. The compiler ends up generating explicit checks to handle the overlapping case. The generated code is:

View in compiler explorer.

Key takeaways

Auto-vectorization of the code can result in significant performance improvement as multiple operations are performed in a single instruction. This has the additional benefit that the compiler can unroll loops with a large number of iterations.

Note however that a lot of benefits accrue only if the compiler knows the number of iterations at compile time. If the iteration count is not known at compile-time, some performance is lost by runtime checks to pick the optimal leg of the generated code.

Letting the compiler know that the input and output vectors don’t overlap also reduces the code bloat. The compiler can avoid costly runtime checks to look for the overlap.

--

--

EventHelix
EventHelix

Written by EventHelix

@EventHelix — 5G | LTE | Networking

No responses yet