Auto-vectorize C and C++ code

Auto-vectorization of for loops results in significant performance improvement

Published in

Software Design

6 min readDec 31, 2021

Developers can take advantage of built-in auto-vectorization support in GCC. Let’s explore vectorization with the GCC 11.2 compiler with the following compiler options to enable auto-vectorization.

-O3 -march=skylake-avx512 -mno-vzeroupper

Vector operations with 8 iterations

Let’s start with the following code. We are performing vector operations where the compiler knows that the code will loop 8 times (VECLEN). Another thing to note is the use of the __restrict keyword. This keyword tells the compiler that the pointer c does not overlap with a and b. Thus, the compiler does not need to keep updating intermediate results.

typedef float vec;
#define VECLEN 8
void vector_op(vec *a , vec *b, vec  *  __restrict c) {
    for (int i=0; i < VECLEN; i++) {
        c[i] = a[i] * b[i];
    }
}

The compiler generates no loop, and the 8 operations are performed with a single set of instructions.

vector_op(float*, float*, float*):
        vmovups ymm0, YMMWORD PTR [rdi]
        vmulps  ymm0, ymm0, YMMWORD PTR [rsi]
        vmovups YMMWORD PTR [rdx], ymm0
        ret

Here the compiler is performing the following operations:

vmovups ymm0, YMMWORD PTR [rdi]

The processor has copied the entire array of 8 entries of a from memory into 8 packed floats stored in the ymm0 register. Note that rdi register points to the start of the a array.

vmulps ymm0, ymm0, YMMWORD PTR [rsi]

Here the processor is performing a vector multiplication for a (ymm0) and b (pointer to b is stored in the rsi register). The final result is saved in the ymm0 register.

vmovups YMMWORD PTR [rdx], ymm0

Now the processor is copying the results to the array c. (pointer to c is stored in the rdx register). This is the last step before ret returns from the function.

View in compiler explorer.

Vector operations with 16 iterations

When the code is changed to 16 iterations, the generated code simply repeats the block we saw with 8 iterations.

typedef float vec;
#define VECLEN 16
void vector_op(vec *a , vec *b, vec  *  __restrict c) {
    for (int i=0; i < VECLEN; i++) {
        c[i] = a[i] * b[i];
    }
}

The generated code unrolls the loop.

vector_op(float*, float*, float*):
        vmovups ymm1, YMMWORD PTR [rdi]
        vmulps  ymm0, ymm1, YMMWORD PTR [rsi]
        vmovups YMMWORD PTR [rdx], ymm0
        vmovups ymm0, YMMWORD PTR [rdi+32]
        vmulps  ymm0, ymm0, YMMWORD PTR [rsi+32]
        vmovups YMMWORD PTR [rdx+32], ymm0
        ret

View in compiler explorer.

Vector operations with 128 iterations

Even with VECLEN set to 128, there is still no loop in sight.

vector_op(float*, float*, float*):
        vmovups ymm1, YMMWORD PTR [rdi]
        vmovups ymm2, YMMWORD PTR [rdi+32]
        vmulps  ymm0, ymm1, YMMWORD PTR [rsi]
        vmovups ymm3, YMMWORD PTR [rdi+64]
        vmovups ymm4, YMMWORD PTR [rdi+96]
        vmovups ymm5, YMMWORD PTR [rdi+128]
        vmovups ymm6, YMMWORD PTR [rdi+160]
        vmovups YMMWORD PTR [rdx], ymm0
        vmulps  ymm0, ymm2, YMMWORD PTR [rsi+32]
        vmovups ymm7, YMMWORD PTR [rdi+192]
        vmovups ymm1, YMMWORD PTR [rdi+224]
        vmovups ymm2, YMMWORD PTR [rdi+256]
        vmovups YMMWORD PTR [rdx+32], ymm0
        vmulps  ymm0, ymm3, YMMWORD PTR [rsi+64]
        vmovups ymm3, YMMWORD PTR [rdi+288]
        vmovups YMMWORD PTR [rdx+64], ymm0
        vmulps  ymm0, ymm4, YMMWORD PTR [rsi+96]
        vmovups ymm4, YMMWORD PTR [rdi+320]
        vmovups YMMWORD PTR [rdx+96], ymm0
        vmulps  ymm0, ymm5, YMMWORD PTR [rsi+128]
        vmovups YMMWORD PTR [rdx+128], ymm0
        vmulps  ymm0, ymm6, YMMWORD PTR [rsi+160]
        vmovups YMMWORD PTR [rdx+160], ymm0
        vmulps  ymm0, ymm7, YMMWORD PTR [rsi+192]
        vmovups YMMWORD PTR [rdx+192], ymm0
        vmulps  ymm0, ymm1, YMMWORD PTR [rsi+224]
        vmovups YMMWORD PTR [rdx+224], ymm0
        vmulps  ymm0, ymm2, YMMWORD PTR [rsi+256]
        vmovups YMMWORD PTR [rdx+256], ymm0
        vmulps  ymm0, ymm3, YMMWORD PTR [rsi+288]
        vmovups YMMWORD PTR [rdx+288], ymm0
        vmulps  ymm0, ymm4, YMMWORD PTR [rsi+320]
        vmovups YMMWORD PTR [rdx+320], ymm0
        vmovups ymm5, YMMWORD PTR [rdi+352]
        vmovups ymm6, YMMWORD PTR [rdi+384]
        vmulps  ymm0, ymm5, YMMWORD PTR [rsi+352]
        vmovups ymm7, YMMWORD PTR [rdi+416]
        vmovups ymm1, YMMWORD PTR [rdi+448]
        vmovups YMMWORD PTR [rdx+352], ymm0
        vmulps  ymm0, ymm6, YMMWORD PTR [rsi+384]
        vmovups YMMWORD PTR [rdx+384], ymm0
        vmulps  ymm0, ymm7, YMMWORD PTR [rsi+416]
        vmovups YMMWORD PTR [rdx+416], ymm0
        vmulps  ymm0, ymm1, YMMWORD PTR [rsi+448]
        vmovups YMMWORD PTR [rdx+448], ymm0
        vmovups ymm0, YMMWORD PTR [rdi+480]
        vmulps  ymm0, ymm0, YMMWORD PTR [rsi+480]
        vmovups YMMWORD PTR [rdx+480], ymm0
        ret

View in compiler explorer.

Vector operations with 256 iterations

With VECLEN set to 256, we do get a loop in the generated code. Note that even with the loop, the processor is performing 8 multiplications per loop. Thus, the generated loop iterates only 32 times.

vector_op(float*, float*, float*):
        xor     eax, eax
.L2:
        vmovups ymm1, YMMWORD PTR [rsi+rax]
        vmulps  ymm0, ymm1, YMMWORD PTR [rdi+rax]
        vmovups YMMWORD PTR [rdx+rax], ymm0
        add     rax, 32
        cmp     rax, 1024
        jne     .L2
        ret

View in compiler explorer.

Vector operations with n iterations

In the above examples, we have considered scenarios where the loop iteration count is known at compile time. Now let’s consider the case where n iterations have to be performed.

typedef float vec;
void vector_op(vec *a , vec *b, vec  *  __restrict c, int n) {
    for (int i=0; i < n; i++) {
        c[i] = a[i] * b[i];
    }
}

The compiler cannot optimize the loop in advance, so we get some run-time checks and a lot of code.

vector_op(float*, float*, float*, int):
        test    ecx, ecx
        jle     .L1
        lea     r9d, [rcx-1]
        cmp     r9d, 6
        jbe     .L8
        mov     r8d, ecx
        shr     r8d, 3
        sal     r8, 5
        xor     eax, eax
.L4:
        vmovups ymm1, YMMWORD PTR [rdi+rax]
        vmulps  ymm0, ymm1, YMMWORD PTR [rsi+rax]
        vmovups YMMWORD PTR [rdx+rax], ymm0
        add     rax, 32
        cmp     rax, r8
        jne     .L4
        mov     eax, ecx
        and     eax, -8
        mov     r8d, eax
        cmp     ecx, eax
        je      .L11
.L3:
        mov     r10d, ecx
        sub     r9d, eax
        sub     r10d, eax
        cmp     r9d, 2
        jbe     .L6
        vmovups xmm2, XMMWORD PTR [rdi+rax*4]
        vmulps  xmm0, xmm2, XMMWORD PTR [rsi+rax*4]
        vmovups XMMWORD PTR [rdx+rax*4], xmm0
        mov     eax, r10d
        and     eax, -4
        add     r8d, eax
        cmp     r10d, eax
        je      .L1
.L6:
        movsx   rax, r8d
        vmovss  xmm0, DWORD PTR [rdi+rax*4]
        vmulss  xmm0, xmm0, DWORD PTR [rsi+rax*4]
        vmovss  DWORD PTR [rdx+rax*4], xmm0
        lea     eax, [r8+1]
        cmp     ecx, eax
        jle     .L1
        cdqe
        vmovss  xmm0, DWORD PTR [rsi+rax*4]
        add     r8d, 2
        vmulss  xmm0, xmm0, DWORD PTR [rdi+rax*4]
        vmovss  DWORD PTR [rdx+rax*4], xmm0
        cmp     ecx, r8d
        jle     .L1
        movsx   r8, r8d
        vmovss  xmm0, DWORD PTR [rdi+r8*4]
        vmulss  xmm0, xmm0, DWORD PTR [rsi+r8*4]
        vmovss  DWORD PTR [rdx+r8*4], xmm0
.L1:
        ret
.L11:
        ret
.L8:
        xor     eax, eax
        xor     r8d, r8d
        jmp     .L3

View in compiler explorer.

Vector operations when the __restrict keyword is not used

Finally, let’s see the impact of not using __restrict keyword. Consider our first 8 iteration loop without __restrict prefix.

typedef float vec;
#define VECLEN 8
void vector_op(vec *a , vec *b, vec  *c) {
    for (int i=0; i < VECLEN; i++) {
        c[i] = a[i] * b[i];
    }
}

Now the compiler cannot assume that a and b do not overlap with c. The compiler ends up generating explicit checks to handle the overlapping case. The generated code is:

vector_op(float*, float*, float*):
        lea     rcx, [rdi+4]
        mov     rax, rdx
        sub     rdx, rcx
        cmp     rdx, 24
        jbe     .L2
        lea     rcx, [rsi+4]
        mov     rdx, rax
        sub     rdx, rcx
        cmp     rdx, 24
        jbe     .L2
        vmovups ymm0, YMMWORD PTR [rsi]
        vmulps  ymm0, ymm0, YMMWORD PTR [rdi]
        vmovups YMMWORD PTR [rax], ymm0
        ret
.L2:
        vmovss  xmm0, DWORD PTR [rdi]
        vmulss  xmm0, xmm0, DWORD PTR [rsi]
        vmovss  DWORD PTR [rax], xmm0
        vmovss  xmm0, DWORD PTR [rdi+4]
        vmulss  xmm0, xmm0, DWORD PTR [rsi+4]
        vmovss  DWORD PTR [rax+4], xmm0
        vmovss  xmm0, DWORD PTR [rdi+8]
        vmulss  xmm0, xmm0, DWORD PTR [rsi+8]
        vmovss  DWORD PTR [rax+8], xmm0
        vmovss  xmm0, DWORD PTR [rdi+12]
        vmulss  xmm0, xmm0, DWORD PTR [rsi+12]
        vmovss  DWORD PTR [rax+12], xmm0
        vmovss  xmm0, DWORD PTR [rdi+16]
        vmulss  xmm0, xmm0, DWORD PTR [rsi+16]
        vmovss  DWORD PTR [rax+16], xmm0
        vmovss  xmm0, DWORD PTR [rdi+20]
        vmulss  xmm0, xmm0, DWORD PTR [rsi+20]
        vmovss  DWORD PTR [rax+20], xmm0
        vmovss  xmm0, DWORD PTR [rdi+24]
        vmulss  xmm0, xmm0, DWORD PTR [rsi+24]
        vmovss  DWORD PTR [rax+24], xmm0
        vmovss  xmm0, DWORD PTR [rdi+28]
        vmulss  xmm0, xmm0, DWORD PTR [rsi+28]
        vmovss  DWORD PTR [rax+28], xmm0
        ret

View in compiler explorer.

Key takeaways

Auto-vectorization of the code can result in significant performance improvement as multiple operations are performed in a single instruction. This has the additional benefit that the compiler can unroll loops with a large number of iterations.

Note however that a lot of benefits accrue only if the compiler knows the number of iterations at compile time. If the iteration count is not known at compile-time, some performance is lost by runtime checks to pick the optimal leg of the generated code.

Letting the compiler know that the input and output vectors don’t overlap also reduces the code bloat. The compiler can avoid costly runtime checks to look for the overlap.

Auto-vectorize C and C++ code

Auto-vectorization of for loops results in significant performance improvement

Vector operations with 8 iterations

Vector operations with 16 iterations

Vector operations with 128 iterations

Vector operations with 256 iterations

Vector operations with n iterations

Vector operations when the __restrict keyword is not used

Key takeaways

Written by EventHelix