Introduction to ARM64 NEON assembly

Mathieu Garcia
mathieugarcia
Published in
7 min readMay 20, 2018

--

This article was written back in 2013, right after Apple released ARM64-based iPhones and iPads.

If you own a somewhat recent iPhone or iPad, you already own a shiny ARM64 CPU to play with.

Let’s start with a trivial operation: adding two vectors of 32-bits floats.

C++ code:

auto add_to(float *pDst, const float *pSrc, long size) noexcept -> void {
for (long i = 0; i < size; i++) {
*pDst++ += *pSrc++;
}
}

As we’ll be writing the entire routine in plain assembly (sorry I hate GCC inline syntax), we need to study a bit the architecture and its calling convention before diving in.

From “Procedure Call Standard for the ARM 64-bit Architecture” and “ARMv8 Instruction Set Overview” we’ll read this:

  • Access to a larger general-purpose register file with 31 unbanked registers (0–30), with each register extended to 64 bits.
  • Floating point and Advanced SIMD processing share a register file, in a similar manner to AArch32, but extended to thirty-two 128-bit registers. Smaller registers are no longer packed into larger registers, but are mapped one-to-one to the low-order bits of the 128-bit register
  • Unaligned addresses are permitted for most loads and stores, including paired…

--

--

Mathieu Garcia
mathieugarcia

Audio/Music Apps Entrepreneur. I’ve been designing audio apps pre-AppStore era and co-created BeatMaker.