Arm Neon Intrinsics Add Functions (Explained With C)
Did you know, Arm Neon Intrinsics have more than 10 different types of vector addition functions? The differences between: Vector Add, Vector Long Add, Vector Wide Add, Vector Rounding Halving Add, Vector Saturating Add, Pairwise Add and Add Across Vector might not appear obvious now, but by the end of this short post they will.
You might be thinking: “Why don’t I just go read the official Arm documentation instead of this post?”. Because I won’t use pretty graphics and fancy pseudo-code. Yeah that’s right, what we are going to do is look at some C code I wrote (and tested) that mimicks what the intrinsics functions do.
Vector Add (vadd)
The Vector Add function is exactly what you expect, here’s the 16x4 version:
Vector Long Add (vaddl)
The 16-bit version of Vector Long Add promotes the result of the addition to the next power of 2. In this case 32 bits.
Vector Wide Add (vaddw)
Notice that for this function, the size of parameter a is the next power of 2 when compared to parameter b. For this addition, parameter b gets promoted to 32 bits.
Vector Rounding Halving Add (vrhadd)
Although additions are used in the process, what you actually get is an element-wise rounded average.
Vector Saturating Add (vqadd)
Whether you call it clipping or saturating, this add function will limit the value to the maximum value that the output type allows.
Pairwise Add (vpadd)
A beast of an add function. Looking at the C code carefully, we see that it concatenates the additions of subsequent pairs of values in a and b.
Add Across Vector (vaddv)
Returns the sum of vector a:
Turns out, I totally wrote this for myself. Hopefully it can also help you make optimal choices when picking Neon add functions.