What is SIMD and how to use it

15 min readMar 19, 2024

The Expansion of SIMD is ‘Single Instruction Multiple Data’ as its name describes. It allows us to process multiple data with a single instruction.
by ‘Process’ I mean performing operations like adding, subtracting, multiplying, dividing, and also logical operations such as ‘and’, ‘or’, and ‘xor’.

In a typical program we do this:

int Add(int a, int b)
{
    int c = a + b;
    return c;
}

What if we want to add multiple arrays or vectors?

void Add(int a[4], int b[4], int c[4])
{
    c[0] = a[0] + b[0];
    c[1] = a[1] + b[1]; 
    c[2] = a[2] + b[2];
    c[3] = a[3] + b[3];
}

In the first example, numbers ‘a’ and ‘b’ are stored in registers and added with one cpu instruction and value is returned, generated assembly code may look like this:

Add:
   add eax, ebx
   ret

With the second example, the number of instructions increases to 14when we compile with Compiler Explorer with Optimizations enabled -O3
- This doesn’t look good.

Add:                           
    mov     eax, dword ptr [rdx]
    add     eax, dword ptr [rcx]
    mov     dword ptr [r8], eax
    mov     eax, dword ptr [rdx + 4]
    add     eax, dword ptr [rcx + 4]
    mov     dword ptr [r8 + 4], eax
    mov     eax, dword ptr [rdx + 8]
    add     eax, dword ptr [rcx + 8]
    mov     dword ptr [r8 + 8], eax
    mov     eax, dword ptr [rdx + 12]
    add     eax, dword ptr [rcx + 12]
    mov     dword ptr [r8 + 12], eax
    ret

Twenty years ago, Intel developed a new instruction set, registers, and dedicated hardware for vector processing with the Pentium III. It allowed us to store vector data in registers and perform calculations on it. Here is how we would add numbers of arrays with Intel SSE instructions:

#include <immintrin.h> //< you can include emmintrin.h if you only want sse
__m128i Add(__m128i a, __m128i b)
{
    return _mm_add_epi32(a, b);
}

A common analogy is cutting multiple vegetables at once instead of one by one, as we do in the kitchen.

With the above code we have reduced number of instructions to two, here is the generated assembly:

Add:
    movaps xmm0, xmm1     ; Move b to xmm0
    paddq xmm0, xmm2      ; Add a and b
    ret

Function arguments a and b are two vector registers that can hold:
4 integers each, or 8 short numbers or 16 8 bit numbers.

the function _mm_add_epi32 adds two arrays of numbers together
_mm_ is a prefix when we call Intel intrinsic functions
_add_ is what we want to do (sub, mul, div…)
_epi32 is data type that we want to use (32 bit integers).

_epi8 for 8bit integers
_epi16 for 16 bit integers
_ps for floats,
_pd for doubles

So if we wanted to add two arrays of 16 bit numbers we could write this:

__m128i Add(__m128i a, __m128i b)
{
    return _mm_add_epi16(a, b); // notice epi16 this time
}

The above vectorized code is equivalent to this scalar code:

void Add(short a[8], short b[8], short c[8])
{
    for (int i = 0; i < 8; i++)
        c[i] = a[i] + b[i];
}

What Is SIMD?

Before delving deep into what SIMD is and how we can utilize it, let’s discuss its essence. CPUs feature an ALU (Arithmetic Logic Unit) responsible for computing integer registers to perform mathematical and logical operations such as addition, multiplication, subtraction, AND, XOR, OR, comparison, and more. Additionally, there was the FPU (Floating-Point Unit) used to handle floating-point numbers.

A Vector Processing Unit (VPU) acts somewhat like a combined ALU/FPU, as it can typically execute both integer and floating-point arithmetic. What sets a VPU apart is its capability to apply arithmetic operations to vectors of input data.
In modern CPUs, there isn’t a distinct FPU component as such. Instead, all floating-point calculations, including those involving scalar float values, are handled by the VPU. The removal of the FPU reduces the transistor count on the CPU die, allowing these transistors to be allocated for implementing larger caches, more complex out-of-order execution logic, and so forth.
(Source: Game Engine Architecture Book)

SSE, which is short for ‘Streaming SIMD Extensions,’ was made by Intel for Pentium III CPUs. Later on, companies like AMD and Intel kept making new versions, like SSE2, SSE3, and SSE 4.2.

These new versions added more instructions, like dot product and string manipulation.

Then came AVX, which lets us handle 256-bit data instead of just 128-bit. This means we can do twice as much in one go.

Then came AVX512, but it’s relatively new hardware that many computers don’t support yet. It’s mainly found in server computers and cloud computing setups, although some of the latest high-end desktop CPUs do support AVX512.

Throughout this text, I’ll showcase C code examples, but it’s worth noting that other programming languages also offer support for SIMD instructions. For instance, C#, Rust, Zig, and many more languages have SIMD instruction support.

Pro tip: In C#, the Vector4 structure is vectorized by default.

Additionally, there’s the ARM NEON instruction set, which allows similar functionality with a slightly different coding approach.

#include <arm_neon.h>
float32x4 a = vdupq_n_f32(2.0f); // [2.0, 2.0, 2.0, 2.0] 4x float
float32x4 b = vdupq_n_f32(PI);   // [PI, PI, PI, PI] 4x float
float32x4 res = vaddq_f32(a, b); // res: [TwoPI, TwoPI, TwoPI, TwoPI]

// Same SSE
#include <immintrin.h>
__m128 a = _mm_set1_ps(2.0f);
__m128 b = _mm_set1_ps(PI);
__m128 res = _mm_add_ps(a, b);
// if you want to create vector by specifiying all elements in SSE
__m128 all = _mm_setr_ps(1.0f, 2.0f, 3.0f, 4.0f); // [1.0, 2.0, 3.0, 4.0] 4x floats

In ARM NEON, functions typically start with ‘v’,
followed by the operation (e.g., ‘dup’, ‘add’, ‘mul’),
then ‘q’ indicating quad for operating on 4 elements,
and finally ‘_f32’ to indicate 32-bit float usage.
Using this syntax we can imagine lots of functions,
we don’t need to look it up somewhere.

Example usage:
vmulq_f32(a, b) -> multiply two float vectors
vsubq_u32(a, b) -> subtract two uint32 vectors // [a.x-b.x, a.y-b.y…]

SIMD Is Everywhere
It has been used in: video encoders, audio processing, image processing
game engines, cloud computing(databases), machine learning, Hashing, Cryptography… List goes on and on.

Advantages

Suppose we aim to optimize a loop using multiple threads, so we instantiate four threads. However, creating threads is a slow process, and after processing, we may need to wait for all threads to finish. With SIMD, we can improve speed by more than 4x without the need to instantiate threads. It’s often easier and more convenient to optimize using SIMD. Nonetheless, if desired, we can also multi-thread the vectorized code.

We can imagine it works like this:
SIMD
||||||||||||||||||||||||||
||||||||||||||||||||||||||
||||||||||||||||||||||||||
||||||||||||||||||||||||||

Multi Threading
||||||||||||||||||||||||
||||||||||||||||||||
||||||||||||||||||||||||||
|||||||||||||||||

We can combine multithreading with SIMD code, as each CPU core has a VPU, allowing us to significantly optimize our code. Since we are performing 4, 8, or 16 tasks simultaneously, we can potentially achieve performance up to ThreadCount * SIMDLaneCount, and even more if we eliminate branches. Additionally, there are fused instructions, such as fused Multiply Add (FMADD), which multiply two numbers and add a third one. This may improve performance compared to performing addition and multiplication separately.

SSE: _mm_fmadd_ps(a, b, c) // a * b + c
NEON: vfmaq_f32(c, a, b) // a * b + c

Some instructions are slower than others; for example, the reverse square root is faster than the square root, and division is slower than multiplication. You can find detailed information about the latency and throughput of each instruction in the CPU vendor’s documentation. Additionally, these constraints vary between processor architectures. Considering these factors is important when optimizing our code for performance.

How to Learn

There are guidelines from CPU vendors for optimizing with SIMD
ARM Neoverse Optimization Guide PDF
Intel® 64 and IA-32 Architectures Optimization Reference Manual Volume 1
ARM Neon Intrinsics Referance
All Intel Intrinsics web site
There are lots of cppcon videos on youtube
Daniel Lemire has awesome blog posts about SIMD and software engineering

Auto Vectorization

Compilers can recognise some patterns and vectorize your code, sometimes it may produce better vectorized code than programmer.

Given this simple code compiler produces much more code than we expected when we compile with optimization flags enabled.

int Sum(int* arr, int n)
{
    int res = 0;
    for (int i = 0; i < n; i++)
        res += arr[i];  
    return res;    
}

We can inspect generated code and if we don’t like it we can write our own
vectorized code.

if we want we can disable auto vectorization with using this macro

#ifndef AX_NO_UNROLL
    #if defined(__clang__)
    #   define AX_NO_UNROLL _Pragma("clang loop unroll(disable)") _Pragma("clang loop vectorize(disable)")
    #elif defined(__GNUC__) >= 8
    #   define AX_NO_UNROLL _Pragma("GCC unroll 0")
    #elif defined(_MSC_VER)
    #   define AX_NO_UNROLL __pragma(loop(no_vector))
    #else
    #   define AX_NO_UNROLL
    #endif
#endif

// usage:
AX_NO_UNROLL while (i < 10) 
{
  // do things 
}

// with for
AX_NO_UNROLL for (size_t i = 0; i < 256; i += 32) {
{
  // do things 
}

It’s also worth knowing that clang can generate pretty good simd code using its vector extension for multiple architectures automatically: https://godbolt.org/z/E1es9qW3f

Basic Math

here is scalar vector lerp function:

Vector3 Lerp(Vector3 a, Vector3 b, float t)
{
   Vector3 v;
   v.x = a.x + (b.x - a.x) * t;
   v.y = a.y + (b.y - a.y) * t;
   v.z = a.z + (b.z - a.z) * t;
   return v;
}

This is SIMD version:

__m128 VecLength(__m128 a, __m128 b, float t)
{
    __m128 aToB     = _mm_sub_ps(b, a);
    __m128 progress = _mm_mul_ps(aToB, _mm_set1_ps(t));
    __m128 result   = _mm_add_ps(a, progress);
    return result;
}

// we can optimize with fused multiply add function:
__m128 VecLerp(__m128 a, __m128 b, float t)
{
   return _mm_fmadd_ps(_mm_sub_ps(b, a), _mm_set1_ps(t), a);
}

// ARM Neon Version:
float32x4 VecLerp(float32x4 a, float32x4 b, float t)
{
    return vfmaq_f32(x, vsubq_f32(b, a), ARMVecSet1(t));
}

But everytime writing both versions can be pain, so I’ve used macros for this purpose, another reason why I’ve used macros is if I use operator overriding or functions for abstracting intrinsics, in debug mode code compiles down to call instruction which is slower than instruction itself, and it avoids many more optimizations that compiler can do.
To give an example I’m using ARM’s astc-encoder it does operator override intrinsics functions and it is darn slow when I compile my project with debug mode, when I try to compress all of the textures in my scene I waited more than half an hour it wasn’t finished, then I give up trying to wait.
But When I compile with release mode all of the performance loss is gone, and after two minutes all of the textures was compressed. that’s why I’ve used macros to inline all of the code.

Here is how I abstracted ARM Neon and SSE :

#if defined(AX_SUPPORT_SSE) && !defined(AX_ARM)
/*//////////////////////////////////////////////////////////////////////////*/
/*                                 SSE                                      */
/*//////////////////////////////////////////////////////////////////////////*/
typedef __m128  vec_t;
typedef __m128  veci_t;
typedef __m128i vecu_t;

#define VecZero()           _mm_setzero_ps()
#define VecOne()            _mm_set1_ps(1.0f)
#define VecSet1(x)          _mm_set1_ps(x)    /* {x, x, x, x} */
#define VeciSet1(x)         _mm_set1_epi32(x) /* {x, x, x, x} */
#define VecSet(x, y, z, w)  _mm_set_ps(x, y, z, w)  /* {w, z, y, x} */
#define VecSetR(x, y, z, w) _mm_setr_ps(x, y, z, w) /* {x, y, z, w} */
#define VecLoad(x)          _mm_loadu_ps(x) /* unaligned load from x pointer */
#define VecLoadA(x)         _mm_load_ps(x) /* load from x pointer */

// Arithmetic
#define VecAdd(a, b) _mm_add_ps(a, b) /* {a.x + b.x, a.y + b.y, a.z + b.z, a.w + b.w} */
#define VecSub(a, b) _mm_sub_ps(a, b) /* {a.x + b.x, a.y + b.y, a.z + b.z, a.w + b.w} */
#define VecMul(a, b) _mm_mul_ps(a, b) /* {a.x + b.x, a.y + b.y, a.z + b.z, a.w + b.w} */
#define VecDiv(a, b) _mm_div_ps(a, b) /* {a.x + b.x, a.y + b.y, a.z + b.z, a.w + b.w} */

#define VecFmadd(a, b, c) _mm_fmadd_ps(a, b, c) /* a * b + c */
#define VecFmsub(a, b, c) _mm_fmsub_ps(a, b, c) /* a * b - c */
#define VecHadd(a, b) _mm_hadd_ps(a, b) /* pairwise add (aw+bz, ay+bx, aw+bz, ay+bx) */

#define VecNeg(a)  _mm_sub_ps(_mm_setzero_ps(), a) /* -a */
#define VecRcp(a)  _mm_rcp_ps(a) /* 1.0f / a */
#define VecSqrt(a) _mm_sqrt_ps(a)

// Vector Math
#define VecDot(a, b)  _mm_dp_ps(a, b, 0xff) /* SSE4.2 required */
#define VecDotf(a, b) _mm_cvtss_f32(_mm_dp_ps(a, b, 0xff))
#define VecNorm(v)    _mm_div_ps(v, _mm_sqrt_ps(_mm_dp_ps(v, v, 0xff)))
#define VecNormEst(v) _mm_mul_ps(_mm_rsqrt_ps(_mm_dp_ps(v, v, 0xff)), v)
#define VecLenf(v)    _mm_cvtss_f32(_mm_sqrt_ss(_mm_dp_ps(v, v, 0xff)))
#define VecLen(v)     _mm_sqrt_ps(_mm_dp_ps(v, v, 0xff))

// Logical
#define VecMax(a, b) _mm_max_ps(a, b) /* [max(a.x, b.x), max(a.x, b.x)...] */
#define VecMin(a, b) _mm_min_ps(a, b) /* [min(a.x, b.x), min(a.x, b.x)...] */

#define VecCmpGt(a, b) _mm_cmpgt_ps(a, b) /* [a.x >  b.x, a.y >  b.y...] */
#define VecCmpGe(a, b) _mm_cmpge_ps(a, b) /* [a.x >= b.x, a.y >= b.y...] */
#define VecCmpLt(a, b) _mm_cmplt_ps(a, b) /* [a.x <  b.x, a.y <  b.y...] */
#define VecCmpLe(a, b) _mm_cmple_ps(a, b) /* [a.x <= b.x, a.y <= b.y...] */
#define VecMovemask(a) _mm_movemask_ps(a)

#define VecSelect(V1, V2, Control)  _mm_blendv_ps(V1, V2, Control);
#define VecBlend(a, b, c) _mm_blendv_ps(a, b, c)
#define VeciBlend(a, b, c) _mm_blendv_ps(a, b, _mm_cvtepi32_ps(c))

#elif defined(AX_ARM)
/*//////////////////////////////////////////////////////////////////////////*/
/*                                 NEON                                     */
/*//////////////////////////////////////////////////////////////////////////*/

typedef float32x4_t vec_t;
typedef uint32x4_t veci_t;
typedef uint32x4_t vecu_t;

#define VecZero()           vdupq_n_f32(0.0f)
#define VecOne()            vdupq_n_f32(1.0f)
#define VecNegativeOne()    vdupq_n_f32(-1.0f)
#define VecSet1(x)          vdupq_n_f32(x)
#define VeciSet1(x)         vdupq_n_u32(x)
#define VecSet(x, y, z, w)  ARMCreateVec(w, z, y, x) /* -> {w, z, y, x} */
#define VecSetR(x, y, z, w) ARMCreateVec(x, y, z, w) /* -> {x, y, z, w} */
#define VecLoad(x)          vld1q_f32(x)
#define VecLoadA(x)         vld1q_f32(x)
#define Vec3Load(x)         ARMVector3Load(x)

// Arithmetic
#define VecAdd(a, b) vaddq_f32(a, b)
#define VecSub(a, b) vsubq_f32(a, b)
#define VecMul(a, b) vmulq_f32(a, b)
#define VecDiv(a, b) ARMVectorDevide(a, b)

#define VecFmadd(a, b, c) vfmaq_f32(c, a, b) /* a * b + c */
#define VecFmsub(a, b, c) vfmsq_f32(a, b, c) /* a * b -c */
#define VecHadd(a, b)    vpaddq_f32(a, b) /* pairwise add (aw+bz, ay+bx, aw+bz, ay+bx) */
#define VecSqrt(a)       vsqrtq_f32(a)
#define VecRcp(a)        vrecpeq_f32(a)
#define VecNeg(a)        vnegq_f32(a)

// Logical
#define VecMax(a, b) vmaxq_f32(a, b) /* [max(a.x, b.x), max(a.x, b.x)...] */
#define VecMin(a, b) vminq_f32(a, b) /* [min(a.x, b.x), min(a.x, b.x)...] */

#define VecCmpGt(a, b) vcgtq_f32(a, b) /* greater or equal */
#define VecCmpGe(a, b) vcgeq_f32(a, b) /* greater or equal */
#define VecCmpLt(a, b) vcltq_f32(a, b) /* less than        */
#define VecCmpLe(a, b) vcleq_f32(a, b) /* less or equal    */
#define VecMovemask(a) ARMVecMovemask(a) 

#define VecSelect(V1, V2, Control) vbslq_f32(Control, V2, V1)
#define VecBlend(a, b, Control)    vbslq_f32(Control, b, a)
#elif NON_SIMD
// not showed in this article
#endif

Source code of the above code is here.
It might look complicated at first look but its just a wraper around instructions nothing fancy.

Remember Lerp Code Above?
we can write it once and run it with multiple platforms

inline vec_t VecLerp(vec_t x, vec_t y, float t)
{
    return VecFmadd(VecSub(y, x), VecSet1(t), x);
}

Also we can multiply matrix with vector

vec_t VECTORCALL Vector4Transform(vec_t v, const vec_t r[4])
{
    vec_t m0;
    m0 = VecMul(r[0], VecSplatX(v));            // r[0] * v[0]
    m0 = VecAdd(VecMul(r[1], VecSplatY(v)), m0); // r[1] * v[1] + m0
    m0 = VecAdd(VecMul(r[2], VecSplatZ(v)), m0); // r[2] * v[2] + m0
    m0 = VecAdd(VecMul(r[3], VecSplatW(v)), m0); // r[3] * v[3] + m0
    return m0;
}

note: if we call this function 4 times, it means we multiply two matrices.
matrix multipication is not our topic, but you get the idea how to use it.

we can define an macro like this one below to do Vector calling conventions, this gives tip to the compiler that we are passing vectors to the function VECTORCALL may improve generated code quality.

#ifdef _MSC_VER
#   include <intrin.h>
# define VECTORCALL __vectorcall
#elif __CLANG__
#   define VECTORCALL [[clang::vectorcall]] 
#elif __GNUC__
#   define VECTORCALL  
#endif

Open Source libraries

DirectXMath: My favorite Math library by microsoft that we can use cross platform and with SSE, AVX, and ARM Neon extensions.

Eigen: A C++ template library for linear algebra: matrices, vectors, numerical solvers, and related algorithms. It provides high-performance implementations using SIMD instructions for many operations.

GLM: Even Though this is not mainly SIMD math library it contains SIMD code that you can enable

Google Highway: Relatively new api that you can compile cross platform SIMD code to preform high performance tasks, it supports all of the vector extensions from all of the CPU vendors webassembly included. But I found it harder to use compared to my library, it has lots of features that I don’t need, my library is more game engine orianted and has faster debug speed and faster compile times.

Linear Algebra & Trigonometry

To understand SIMD I want to give more examples, we don’t have to understand math behind it I just want to show how it works.
Arc tangent estimation:

static const float sa1 =  0.99997726f, sa3 = -0.33262347f, sa5  = 0.19354346f,
                   sa7 = -0.11643287f, sa9 =  0.05265332f, sa11 = -0.01172120f;

inline float Atan(float x) // scalar
{
    const float xx = x * x;
    float res = sa11; 
    res = xx * res + sa9;
    res = xx * res + sa7;
    res = xx * res + sa5;
    res = xx * res + sa3;
    res = xx * res + sa1;
    return x * res;
}
 
// computes 4x atan
inline vec_t VECTORCALL VecAtan(vec_t x) // vectorized
{
    const vec_t xx = VecMul(x, x);
    vec_t res = VecSet1(sa11); 
    res = VecFmadd(xx, res, VecSet1(sa9)); // xx * res + sa9
    res = VecFmadd(xx, res, VecSet1(sa7));
    res = VecFmadd(xx, res, VecSet1(sa5));
    res = VecFmadd(xx, res, VecSet1(sa3));
    res = VecFmadd(xx, res, VecSet1(sa1));
    return VecMul(x, res); 
}

Almost same assembly code is generated for both of the functions.

Vector Functions

There are couple of linear algebra vector functions that I want to show.
first one is length function, length of an vector is simply multiply each element by itself and sum all of it then take square root of it.

You can see that inside of square root we are actually doing dot product with vector itself (multiply each element by itself and sum) so we can say:
length = sqrt(dot(v, v));
To normalize an vector we should devide each element to its length:
normVec = vec / vecLen;
We can write these functions with simd like this:

inline vec_t VecLength(vec_t v) {
    return VecSqrt(VecDot(v, v)); // return [vLen, vLen, vLen, vLen]
}

inline vec_t VecNorm(vec_t v) {
    return VecDiv(v, VecSqrt(VecDot(v, v))); // v / vLen
    // _mm_div_ps(v, _mm_sqrt_ps(_mm_dp_ps(v, v, 0xff)))// < SSE equivalent
}

// estimate normalze is faster than normalze
// it uses rsqrt and mul instead of sqrt and div
inline vec_t VecNormEst(vec_t v) {
    return VecMul(v, VecRsqrt(VecDot(v, v))); // v * (vRcpLen)
}

Here is how to use normalize function


inline FrustumPlanes CreateFrustumPlanes(const Matrix4& viewProjection)
{
    FrustumPlanes result;
    // transpose is in SSE with shuffles, in Neon with vzipq_f32 instruction
    Matrix4 C = Matrix4::Transpose(viewProjection); 
    result.planes[0] = VecNormEst(VecAdd(C.r[3], C.r[0])); // left_plane
    result.planes[1] = VecNormEst(VecSub(C.r[3], C.r[0])); // right_plane
    result.planes[2] = VecNormEst(VecAdd(C.r[3], C.r[1])); // bottom_plane
    result.planes[3] = VecNormEst(VecSub(C.r[3], C.r[1])); // top_plane
    result.planes[4] = VecNormEst(C.r[2]);                 // near_plane  
    // result.planes[5] = VecNormEst(VecSub(C.r[3], C.r[2])); // far_plane
    return result;
}

Structure of Arrays (SOA)

There is another way of using SIMD optimizations and that is SoA instead of AoS. Instead of storing data into lots of small structs we can store our data in arrays of elements let me give an example.

Let’s say we have lot of enemies in an game, we can store all of the enemies in one struct by storing its elements in arrays rather than each enemy stored its elements separately, imagine we have an function that gives damage to near enemies, to calculate which ones we should damage we can calculate enemy distances to the player then add the damage.

Consider Distance function:

struct Enemies
{
    vec_t posX[256]; // or heap allocated arrays doesn't matter 
    vec_t posY[256];
    vec_t posZ[256];
    vec_t health[256]; // each vec_t stores 4x floats
};

Enemies enemies; // 1024 enemy
int numEnemies;

void DamageAllCloseEnemies(vec_t playerPos, float playerDamage, float dist)
{
    // go through enemies 4 by 4 instead of one by one
    // each iteration calculates 4 seperate enemy distance to the player 
    // and subtracts player damage from enemy healths if enemy is close to the player
    for (int i = 0; i < numEnemies / 4; i++)
    {
        vec_t xDiff = VecSub(VecSplatX(playerPos), enemies.posX[i]); // (ax - bx)
        vec_t yDiff = VecSub(VecSplatY(playerPos), enemies.posY[i]); // (ay - by)
        vec_t zDiff = VecSub(VecSplatZ(playerPos), enemies.posZ[i]); // (az - bz)
        
        xDiff = VecMul(xDiff, xDiff); // xDiff * xDiff
        yDiff = VecMul(yDiff, yDiff); // yDiff * yDiff
        zDiff = VecMul(zDiff, zDiff); // zDiff * zDiff
        
        // sqrt(xdif2 + ydif2 + zdif2);        
        vec_t distances = VecSqrt(VecAdd(xDiff, VecAdd(yDiff, zDiff))); 
        // distances < dist        
        veci_t isCloser = VecCmpLt(distances, VecSet1(dist)); 
        // damage = isCloser ? playerDamage : 0.0f;
        vec_t damage    = VecBlend(VecZero(), VecSet1(playerDamage), isCloser);
        // enemy.health -= damage;
        enemies.health[i] = VecSub(enemies.health[i], damage);
    }
}

This was something that I saw one of Mike Acton’s Data Orianted Design talk.
We have to design our data differently to use this technique.

Here is scalar version that does the exact same thing:

struct Enemy
{
    Vector3f pos;
    float health;
};

Enemies enemies[1024];
int numEnemies;

float Vec3Dist(Vector3f a, Vector3f b)
{
  float xDiff = a.x - b.x;
  float yDiff = a.y - b.y;
  float zDiff = a.z - b.z;  
  return Sqrt(xDiff * xDiff + yDiff * yDiff + zDiff * zDiff);  
}

void DamageAllCloseEnemies(Vector3f playerPos, float playerDamage, float dist)
{
    for (int i = 0; i < numEnemies; i++)
    {
        if (Vec3Dist(playerPos, enemies[i].position) < dist)
        {
            enemies[i].health -= playerDamage;
        }
    }
}

The simple example above illustrates how utilizing the structure of arrays with SIMD can optimize code by over 5 times, all while eliminating branches. While this may seem like over-engineering for scenarios with only a few enemies, in my experience developing a lawn mowing game with millions of grass blades, such optimizations were essential for performance, especially on Android devices.

In situations where you’re dealing with thousands of enemies or similar large datasets, SIMD optimization can be invaluable. It’s important to note that utilizing AVX (Advanced Vector Extensions) could potentially double the performance gain.

Understanding the hardware we’re working with is crucial for maximizing code efficiency. For instance, combining SIMD optimization with multithreading can allow handling of millions of enemies efficiently. Without such optimizations, the performance of scalar code would significantly lag behind.

Processor vendors and engineers have been developing better hardware every year. However, it’s important that we utilize this hardware to its fullest potential.

SIMD has been around for over 20 years, yet many programmers don’t know about it or its benefits.
That’s why I wanted to write this text: to explain why SIMD matters and how it can make our software faster.

In conclusion, the utilization of SIMD optimizations offers a powerful means to enhance the performance of our software across a wide range of applications. By understanding the principles and techniques of SIMD programming and carefully optimizing our code, we can unlock significant performance gains, enabling us to tackle even the most demanding computational tasks with efficiency and speed.

Thanks Mārtiņš Možeiko for his feedbacks.