Harnessing the Power of SIMD Programming Using AVX

7 min readOct 27, 2023

In the fast-evolving world of Computer Science, optimizing your code for performance is often the key to staying competitive. One powerful technique that can significantly boost the efficiency of your computations is SIMD programming using AVX (Advanced Vector Extensions). But what exactly is SIMD, why should you care about it, and how can you use it to supercharge your deep learning projects? This comprehensive guide will provide you with all the answers you need.

Not everyone will need to learn to use SIMD programming. Understanding the intrinsic workings of different computational kernels would give you a better perspective when you are writing code, even if you write an interpreted language.

What is SIMD Programming?

Let’s start by breaking down the term. SIMD stands for Single Instruction, Multiple Data. It’s a type of parallelism used in computing that allows a single instruction to perform the same operation on multiple data elements simultaneously. In simpler terms, SIMD enables you to process multiple pieces of data with a single command, which can lead to significant performance improvements.

We are all aware of the Arithmetic and Logic Units (ALU’s) inside our processors that performs all the additions and multiplications. Our initial understanding of how a processor adds a set of numbers is that, it takes two numbers at a time and adds them one by one. Although this is possible, engineers decided to devise SIMD programming where they could add multiple numbers at the same time. That is what SIMD stands for. Here, the ‘Single Instruction’ is Add and the ‘Multiple Data’ will be a group of numbers. Later in this blog, let us examine an actual program on how to do this in C++.

Why Should You Care About SIMD?

Before we dive deeper into the world of SIMD programming using AVX, it’s important to understand why it matters in the context of deep learning and data science, even if you primarily work with high-level languages like Python.

1. Speed and Efficiency

Deep learning models, particularly neural networks, involve a tremendous amount of mathematical computations. These computations are often performed on large datasets, which can be time-consuming. SIMD allows you to accelerate these calculations by performing operations on multiple data points simultaneously. This means faster training times and quicker model evaluation.

2. Parallelism

Modern processors are designed with parallelism in mind. They contain multiple cores that can execute instructions concurrently. SIMD leverages this hardware parallelism by efficiently using these cores to process data in parallel. This can result in a significant reduction in the time it takes to complete tasks.

3. Performance Optimization

Deep learning frameworks like TensorFlow and PyTorch are built on lower-level libraries that utilize SIMD instructions to optimize performance. By understanding SIMD programming, you can gain insights into how these frameworks work under the hood. This knowledge empowers you to write more efficient code and customize your deep learning pipelines for maximum speed. Furthermore, there are processor-specific optimizations we can perform if we understand the underlying computational mechanisms well.

What is AVX?

Now that we’ve established the importance of SIMD programming, let’s take a closer look at AVX or Advanced Vector Extensions. AVX is an instruction set architecture extension introduced by Intel and AMD for x86 processors. It extends the SIMD capabilities of these processors by introducing wider registers and new instructions for performing vector operations.

All 8 numbers get added in a single instruction in AVX

In some modern processors, there can be multiple Vector registers (In simple terms, vector registers are registers that hold more than one data and perform the same process on all of them). These modern processors can perform multiple vector instructions in a single clock cycle. So can you estimate how many additions can an Intel processor with a max clock speed of 3.5GHz and 12 cores do?

Why AVX Matters

AVX is a game-changer for SIMD programming because it provides an extensive set of instructions to perform vector processing. Some of these instructions are:

Add
Multiply
Store
Load
More operations to move values in and around the registers.

Furthermore, Intel has blessed us with an extensive guide for referring to the different AVX commands in the form of this Intrinsics Guide

AVX and AVX512

Before we look at how we can write code using AVX, let us briefly discuss what AVX and AVX512 means. AVX is a standard for processing 256 bit information. What this means is that it can hold 256 bits of information and process them. Let us recall how much a single precision float number takes up in memory, you’re right! It is 32 bits. So we can fit 8 float numbers each 32 bits of size in 256 bits of our AVX registers.

If you’re connecting the dots, it means that we can process 8 floating point numbers at the same time using our SIMD programming paradigm. And like the name suggests, AVX512 can hold 512 bits of data. However, it is only available to some of the higher-end and server-grade processors in Intel. For most of us with personal computers, we will have to experiment with the normal AVX instructions.

How to Implement SIMD Programming Using AVX

Now that we understand the concepts behind SIMD and AVX, let’s dive into the practical implementation. We’ll walk through some real-world coding examples in C++ to illustrate the power of SIMD programming.

Scenario: Adding two arrays

Imagine you have two large arrays of size n. And you want to add them together. Let us examine how we would do this using AVX. First let us look into the pseudo code of how we would perform this:

Loop:
While i < n:
    Load 8 elements from Array A into vector register X
    Load 8 elements from Array B into vector register Y
    
    Perform vector addition:
    Z = X + Y
    
    Store the 8 elements from vector register Z into Array C
    
    Increment i by 8 (to move to the next 8 elements)

End Loop

Imagine we have two arrays A and B. We want to add them at every index and store the values to an array C. Ideally, we would iterate from 0 to n and then write a simple code something like:

Loop:
While i < n:
    C[i] = A[i] + B[i]
    Increment i by 1.
End Loop

However, when it comes to writing AVX code, we will have to load them explicitly to registers. After that we again have to explicitly write them to the array. It might seem like a long stretch, but the advantage is that we get to perform 8 adds in a single step.

Writing it in C++

#include <immintrin.h>  // Include AVX header

void vectorSum(float* A, float* B, float* C, int n) {
    // Loop for processing 8 elements at a time
    for (int i = 0; i < limit; i += 8) {
        // Load 8 elements from Arrays A and B into AVX registers
        __m256 avx_a = _mm256_loadu_ps(&A[i]);
        __m256 avx_b = _mm256_loadu_ps(&B[i]);

        // Perform vector addition
        __m256 avx_result = _mm256_add_ps(avx_a, avx_b);

        // Store the result back into Array C
        _mm256_storeu_ps(&C[i], avx_result);
    }
}

Trust the process! If you’re with me so far, don’t worry, don’t let the __m_blah_blah scare you. They are pretty simple once you understand what that means.

First and foremost, since we are using AVX instructions, we need to import immintrin.h header file.
__m256 indicates that it is a 256-bit register, it is typically used for handling float. There are other variants, __m256i and __m256d for integers and double respectively.
_mm256_loadu_ps: As you can guess, it is used for loading data. But what does the ps mean? Single Precision! I.e. our float data. Since we can hold 8 single-precision float values, that is what it expects. It expects the address of an array that has 8 values ready to be loaded. And like our __m256, it has its variants for double, integers, and shorter numbers.
_mm256_add_ps: Takes two registers adds the 8 numbers in those registers and returns a register with the result.
_mm256_storeu_ps: Just like load, it stores the values in a register to an array index.

Also, note that we are incrementing the array by 8 indices as we are processing 8 numbers at a time.

This all might feel overwhelming now, but the Intel Intrinsics guide is your best friend when it comes to navigating what these functions are and what they mean.

Finally, you just compile this program just like a normal C++ program. However, you need to add the ‘-mavx’ flag with the normal compiling instruction to let the compiler know that you are using AVX instructions.

Benefits of Implementing SIMD with AVX

Implementing SIMD programming using AVX can yield substantial benefits in terms of speed and efficiency. Let’s summarize these advantages:

1. Speed Boost

By processing multiple data elements in parallel, AVX accelerates your code, leading to faster execution times. This speed boost is crucial for applications like real-time image processing, video rendering, and deep learning.

2. Reduced Energy Consumption

Faster execution not only saves time but also reduces energy consumption. This is particularly important in applications running on battery-powered devices, where energy efficiency is a primary concern.

3. Compatibility

AVX is widely supported on modern x86 processors, ensuring compatibility with a broad range of hardware. This means your optimized code will run efficiently on a variety of systems.

4. Optimizing Low-level libraries

Many interpreted languages like Python bring computational code up to speed by calling C libraries. And they are written using these low-level instructions. Most of us will be encountering those libraries if ever we venture into the quest of optimizing runtime for our deep learning models. So when that day comes, don’t forget that your journey started here! :)

Conclusion

This was meant to be a high-level introduction to SIMD programming using AVX. There are a lot of advanced concepts that need to be understood before we can actually leverage the full potential of our machines. One key point to note is that existing libraries and languages are written with a high degree of optimization.

Of course, there is a lot of scope for improvement, however using this low-level programming paradigm in the wrong way can worsen performance. But that should not stop you from looking into how BLAS libraries or deep learning libraries are written. Finally, computation is one part of the equation, and another part is memory, but that is for another day!

Happy coding!