Introduction to ARM64 NEON assembly

Mathieu Garcia

Published in

mathieugarcia

7 min readMay 20, 2018

This article was written back in 2013, right after Apple released ARM64-based iPhones and iPads.

If you own a somewhat recent iPhone or iPad, you already own a shiny ARM64 CPU to play with.

Let’s start with a trivial operation: adding two vectors of 32-bits floats.

C++ code:

auto add_to(float *pDst, const float *pSrc, long size) noexcept -> void {
   for (long i = 0; i < size; i++) {
     *pDst++ += *pSrc++;
   }
}

As we’ll be writing the entire routine in plain assembly (sorry I hate GCC inline syntax), we need to study a bit the architecture and its calling convention before diving in.

From “Procedure Call Standard for the ARM 64-bit Architecture” and “ARMv8 Instruction Set Overview” we’ll read this:

Access to a larger general-purpose register file with 31 unbanked registers (0–30), with each register extended to 64 bits.
Floating point and Advanced SIMD processing share a register file, in a similar manner to AArch32, but extended to thirty-two 128-bit registers. Smaller registers are no longer packed into larger registers, but are mapped one-to-one to the low-order bits of the 128-bit register
Unaligned addresses are permitted for most loads and stores, including paired…

Introduction to ARM64 NEON assembly

Written by Mathieu Garcia