Auto-vectorize C and C++ code
Auto-vectorization of for loops results in significant performance improvement
Developers can take advantage of built-in auto-vectorization support in GCC. Let’s explore vectorization with the GCC 11.2 compiler with the following compiler options to enable auto-vectorization.
-O3 -march=skylake-avx512 -mno-vzeroupper
Vector operations with 8 iterations
Let’s start with the following code. We are performing vector operations where the compiler knows that the code will loop 8 times (VECLEN
). Another thing to note is the use of the __restrict
keyword. This keyword tells the compiler that the pointer c
does not overlap with a
and b
. Thus, the compiler does not need to keep updating intermediate results.
typedef float vec;
#define VECLEN 8
void vector_op(vec *a , vec *b, vec * __restrict c) {
for (int i=0; i < VECLEN; i++) {
c[i] = a[i] * b[i];
}
}
The compiler generates no loop, and the 8 operations are performed with a single set of instructions.
vector_op(float*, float*, float*):
vmovups ymm0, YMMWORD PTR [rdi]
vmulps ymm0, ymm0, YMMWORD PTR [rsi]
vmovups YMMWORD PTR [rdx], ymm0
ret
Here the compiler is performing the following operations:
vmovups ymm0, YMMWORD PTR [rdi]
The processor has copied the entire array of 8 entries of a
from memory into 8 packed floats stored in the ymm0
register. Note that rdi
register points to the start of the a
array.
vmulps ymm0, ymm0, YMMWORD PTR [rsi]
Here the processor is performing a vector multiplication for a
(ymm0)
and b
(pointer to b
is stored in the rsi
register). The final result is saved in the ymm0
register.
vmovups YMMWORD PTR [rdx], ymm0
Now the processor is copying the results to the array c
. (pointer to c
is stored in the rdx
register). This is the last step before ret
returns from the function.
Vector operations with 16 iterations
When the code is changed to 16 iterations, the generated code simply repeats the block we saw with 8 iterations.
typedef float vec;
#define VECLEN 16
void vector_op(vec *a , vec *b, vec * __restrict c) {
for (int i=0; i < VECLEN; i++) {
c[i] = a[i] * b[i];
}
}
The generated code unrolls the loop.
vector_op(float*, float*, float*):
vmovups ymm1, YMMWORD PTR [rdi]
vmulps ymm0, ymm1, YMMWORD PTR [rsi]
vmovups YMMWORD PTR [rdx], ymm0
vmovups ymm0, YMMWORD PTR [rdi+32]
vmulps ymm0, ymm0, YMMWORD PTR [rsi+32]
vmovups YMMWORD PTR [rdx+32], ymm0
ret
Vector operations with 128 iterations
Even with VECLEN
set to 128, there is still no loop in sight.
vector_op(float*, float*, float*):
vmovups ymm1, YMMWORD PTR [rdi]
vmovups ymm2, YMMWORD PTR [rdi+32]
vmulps ymm0, ymm1, YMMWORD PTR [rsi]
vmovups ymm3, YMMWORD PTR [rdi+64]
vmovups ymm4, YMMWORD PTR [rdi+96]
vmovups ymm5, YMMWORD PTR [rdi+128]
vmovups ymm6, YMMWORD PTR [rdi+160]
vmovups YMMWORD PTR [rdx], ymm0
vmulps ymm0, ymm2, YMMWORD PTR [rsi+32]
vmovups ymm7, YMMWORD PTR [rdi+192]
vmovups ymm1, YMMWORD PTR [rdi+224]
vmovups ymm2, YMMWORD PTR [rdi+256]
vmovups YMMWORD PTR [rdx+32], ymm0
vmulps ymm0, ymm3, YMMWORD PTR [rsi+64]
vmovups ymm3, YMMWORD PTR [rdi+288]
vmovups YMMWORD PTR [rdx+64], ymm0
vmulps ymm0, ymm4, YMMWORD PTR [rsi+96]
vmovups ymm4, YMMWORD PTR [rdi+320]
vmovups YMMWORD PTR [rdx+96], ymm0
vmulps ymm0, ymm5, YMMWORD PTR [rsi+128]
vmovups YMMWORD PTR [rdx+128], ymm0
vmulps ymm0, ymm6, YMMWORD PTR [rsi+160]
vmovups YMMWORD PTR [rdx+160], ymm0
vmulps ymm0, ymm7, YMMWORD PTR [rsi+192]
vmovups YMMWORD PTR [rdx+192], ymm0
vmulps ymm0, ymm1, YMMWORD PTR [rsi+224]
vmovups YMMWORD PTR [rdx+224], ymm0
vmulps ymm0, ymm2, YMMWORD PTR [rsi+256]
vmovups YMMWORD PTR [rdx+256], ymm0
vmulps ymm0, ymm3, YMMWORD PTR [rsi+288]
vmovups YMMWORD PTR [rdx+288], ymm0
vmulps ymm0, ymm4, YMMWORD PTR [rsi+320]
vmovups YMMWORD PTR [rdx+320], ymm0
vmovups ymm5, YMMWORD PTR [rdi+352]
vmovups ymm6, YMMWORD PTR [rdi+384]
vmulps ymm0, ymm5, YMMWORD PTR [rsi+352]
vmovups ymm7, YMMWORD PTR [rdi+416]
vmovups ymm1, YMMWORD PTR [rdi+448]
vmovups YMMWORD PTR [rdx+352], ymm0
vmulps ymm0, ymm6, YMMWORD PTR [rsi+384]
vmovups YMMWORD PTR [rdx+384], ymm0
vmulps ymm0, ymm7, YMMWORD PTR [rsi+416]
vmovups YMMWORD PTR [rdx+416], ymm0
vmulps ymm0, ymm1, YMMWORD PTR [rsi+448]
vmovups YMMWORD PTR [rdx+448], ymm0
vmovups ymm0, YMMWORD PTR [rdi+480]
vmulps ymm0, ymm0, YMMWORD PTR [rsi+480]
vmovups YMMWORD PTR [rdx+480], ymm0
ret
Vector operations with 256 iterations
With VECLEN
set to 256, we do get a loop in the generated code. Note that even with the loop, the processor is performing 8 multiplications per loop. Thus, the generated loop iterates only 32 times.
vector_op(float*, float*, float*):
xor eax, eax
.L2:
vmovups ymm1, YMMWORD PTR [rsi+rax]
vmulps ymm0, ymm1, YMMWORD PTR [rdi+rax]
vmovups YMMWORD PTR [rdx+rax], ymm0
add rax, 32
cmp rax, 1024
jne .L2
ret
Vector operations with n iterations
In the above examples, we have considered scenarios where the loop iteration count is known at compile time. Now let’s consider the case where n
iterations have to be performed.
typedef float vec;
void vector_op(vec *a , vec *b, vec * __restrict c, int n) {
for (int i=0; i < n; i++) {
c[i] = a[i] * b[i];
}
}
The compiler cannot optimize the loop in advance, so we get some run-time checks and a lot of code.
vector_op(float*, float*, float*, int):
test ecx, ecx
jle .L1
lea r9d, [rcx-1]
cmp r9d, 6
jbe .L8
mov r8d, ecx
shr r8d, 3
sal r8, 5
xor eax, eax
.L4:
vmovups ymm1, YMMWORD PTR [rdi+rax]
vmulps ymm0, ymm1, YMMWORD PTR [rsi+rax]
vmovups YMMWORD PTR [rdx+rax], ymm0
add rax, 32
cmp rax, r8
jne .L4
mov eax, ecx
and eax, -8
mov r8d, eax
cmp ecx, eax
je .L11
.L3:
mov r10d, ecx
sub r9d, eax
sub r10d, eax
cmp r9d, 2
jbe .L6
vmovups xmm2, XMMWORD PTR [rdi+rax*4]
vmulps xmm0, xmm2, XMMWORD PTR [rsi+rax*4]
vmovups XMMWORD PTR [rdx+rax*4], xmm0
mov eax, r10d
and eax, -4
add r8d, eax
cmp r10d, eax
je .L1
.L6:
movsx rax, r8d
vmovss xmm0, DWORD PTR [rdi+rax*4]
vmulss xmm0, xmm0, DWORD PTR [rsi+rax*4]
vmovss DWORD PTR [rdx+rax*4], xmm0
lea eax, [r8+1]
cmp ecx, eax
jle .L1
cdqe
vmovss xmm0, DWORD PTR [rsi+rax*4]
add r8d, 2
vmulss xmm0, xmm0, DWORD PTR [rdi+rax*4]
vmovss DWORD PTR [rdx+rax*4], xmm0
cmp ecx, r8d
jle .L1
movsx r8, r8d
vmovss xmm0, DWORD PTR [rdi+r8*4]
vmulss xmm0, xmm0, DWORD PTR [rsi+r8*4]
vmovss DWORD PTR [rdx+r8*4], xmm0
.L1:
ret
.L11:
ret
.L8:
xor eax, eax
xor r8d, r8d
jmp .L3
Vector operations when the __restrict keyword is not used
Finally, let’s see the impact of not using __restrict
keyword. Consider our first 8 iteration loop without __restrict
prefix.
typedef float vec;
#define VECLEN 8
void vector_op(vec *a , vec *b, vec *c) {
for (int i=0; i < VECLEN; i++) {
c[i] = a[i] * b[i];
}
}
Now the compiler cannot assume that a
and b
do not overlap with c
. The compiler ends up generating explicit checks to handle the overlapping case. The generated code is:
vector_op(float*, float*, float*):
lea rcx, [rdi+4]
mov rax, rdx
sub rdx, rcx
cmp rdx, 24
jbe .L2
lea rcx, [rsi+4]
mov rdx, rax
sub rdx, rcx
cmp rdx, 24
jbe .L2
vmovups ymm0, YMMWORD PTR [rsi]
vmulps ymm0, ymm0, YMMWORD PTR [rdi]
vmovups YMMWORD PTR [rax], ymm0
ret
.L2:
vmovss xmm0, DWORD PTR [rdi]
vmulss xmm0, xmm0, DWORD PTR [rsi]
vmovss DWORD PTR [rax], xmm0
vmovss xmm0, DWORD PTR [rdi+4]
vmulss xmm0, xmm0, DWORD PTR [rsi+4]
vmovss DWORD PTR [rax+4], xmm0
vmovss xmm0, DWORD PTR [rdi+8]
vmulss xmm0, xmm0, DWORD PTR [rsi+8]
vmovss DWORD PTR [rax+8], xmm0
vmovss xmm0, DWORD PTR [rdi+12]
vmulss xmm0, xmm0, DWORD PTR [rsi+12]
vmovss DWORD PTR [rax+12], xmm0
vmovss xmm0, DWORD PTR [rdi+16]
vmulss xmm0, xmm0, DWORD PTR [rsi+16]
vmovss DWORD PTR [rax+16], xmm0
vmovss xmm0, DWORD PTR [rdi+20]
vmulss xmm0, xmm0, DWORD PTR [rsi+20]
vmovss DWORD PTR [rax+20], xmm0
vmovss xmm0, DWORD PTR [rdi+24]
vmulss xmm0, xmm0, DWORD PTR [rsi+24]
vmovss DWORD PTR [rax+24], xmm0
vmovss xmm0, DWORD PTR [rdi+28]
vmulss xmm0, xmm0, DWORD PTR [rsi+28]
vmovss DWORD PTR [rax+28], xmm0
ret
Key takeaways
Auto-vectorization of the code can result in significant performance improvement as multiple operations are performed in a single instruction. This has the additional benefit that the compiler can unroll loops with a large number of iterations.
Note however that a lot of benefits accrue only if the compiler knows the number of iterations at compile time. If the iteration count is not known at compile-time, some performance is lost by runtime checks to pick the optimal leg of the generated code.
Letting the compiler know that the input and output vectors don’t overlap also reduces the code bloat. The compiler can avoid costly runtime checks to look for the overlap.