Introduction to ARM64 NEON assembly
This article was written back in 2013, right after Apple released ARM64-based iPhones and iPads.
If you own a somewhat recent iPhone or iPad, you already own a shiny ARM64 CPU to play with.
Let’s start with a trivial operation: adding two vectors of 32-bits floats.
C++ code:
auto add_to(float *pDst, const float *pSrc, long size) noexcept -> void {
for (long i = 0; i < size; i++) {
*pDst++ += *pSrc++;
}
}
As we’ll be writing the entire routine in plain assembly (sorry I hate GCC inline syntax), we need to study a bit the architecture and its calling convention before diving in.
From “Procedure Call Standard for the ARM 64-bit Architecture” and “ARMv8 Instruction Set Overview” we’ll read this:
- Access to a larger general-purpose register file with 31 unbanked registers (0–30), with each register extended to 64 bits.
- Floating point and Advanced SIMD processing share a register file, in a similar manner to AArch32, but extended to thirty-two 128-bit registers. Smaller registers are no longer packed into larger registers, but are mapped one-to-one to the low-order bits of the 128-bit register
- Unaligned addresses are permitted for most loads and stores, including paired…