The curious case of unaligned access on ARM

Levente Kurusa
4 min readDec 27, 2016

--

ARM is an amazing architecture. I have no doubt about that. I always liked it. It is RISC, so we don’t have a plenitude of redundant instructions and in most cases there is only one way to achieve a given result. All instruction are 32 bit long, equal width. We get a plethora of general purpose registers. But this comes at a price. Unaligned access is notoriously hard and hence, a trouble we encounter all too often. Even so, because sometime ago this wouldn’t work.

At the very first, what does alignment even mean? What does it mean that addr is aligned to X? Simply put, it means that the condition addr % X == 0 is true, where % represents the integer modulo operation. For instance, 8000 is aligned to 2, 4, 8, etc, but not aligned to 3. Interesting to note that every address is aligned to 1, which also, amusingly, means it is unaligned. Hence, unaligned access simply means that a memory address that is being accessed is not aligned to the proper value, some instructions like LDRH require a 2-byte alignment, whereas instructions like LDR and STR require a 4-byte alignment for optimal performance. More on performance, later.

Let’s have a look Before ARMv5, ARM didn’t support unaligned access in the expected way. For STR, STM and LDM, the requested address was simply rounded down to a multiple of four. However, for LDR after the address was rounded down, it is byte-rotated right by the value in bits [1:0] of the original address. That is an instruction LDR r0, [r1] can be implemented in software like this, if r1 were to be unaligned:

  BIC    rT, r1, #3  // bit-clear the bottom 2 bits to align it
LDR r0, [rT] // load the value at aligned address
AND rT, r1, #3 // select the bottom 2 bits from the original
MOV rT, rT, LSL #3. // multiply it by 8
MOV r0, r0, ROR rT // rotate right by the appropriate value

(Noting that rT is an arbitrary temporary register)

What about LDRH? If bit 1 is set during a halfword access, then the result is simply undefined and hence unpredictable. Similarly for LDRD (doubleword transfer), the address has to be aligned to 4 bytes, otherwise, it is unpredictable.

Beginning with ARMv7, however, unaligned access began to be supported. It now does the expected, i.e. breaking up the access into multiple smaller reads and builds up the value as a “traditional” x86 CPU would do it. However, there are time delays in this. Noteworthy though is the fact that LDM and STM instructions still require a 4-byte alignment and if they don’t have it, the result is unpredictable. This version also added an A bit into the SCTLR (System Control Register) where you can enable alignment checking. Essentially, if this bit is set then every unaligned access will result in the ARM trapping your code into the trap specified by the trap-vector beginning at address zero.

What about ARMv6? ARMv6 was an intermediary ISA in this sense, where the designers of the instruction set decided that they’ll support both the way of ARMv5 and the way of ARMv7. There is now a new U bit in the SCTLR to control whether the ARMv5 way should be followed or the ARMv6 one.

[1] SCTLR showing the location of the A and U bits
[1] SCTLR showing the location of the A and U bits

But, why is unaligned access so hard or bad? The problem lies in multiple layers, both in software and in hardware. One of the simplest is the fact that if an unaligned access spans multiple pages (i.e., regions of the RAM) there may be permission differences. For example, as a user I may be able to access the byte starting at 0x000FF, but not the one at 0x00100. Hence, we need to do multiple permission checks, resulting in immediately degraded performance.
Another one is the fact that on cores beginning ARMv6 and later, in order for the hardware to “fix-up” an unaligned access, it splits it up into multiple smaller, byte loads. However, these are not atomic!. Hence, even a Data Abort can happen between two parts of the value. It is also intuitive that the performance can be degraded from forcing the core to replace an access that was meant to be a single one, with more accesses of smaller width. (Note that the Linux kernel has an option in /proc/cpu/alignment that lets the kernel emulate this behavior in software, yikes at the performance!)

So, why am I writing about unaligned access? We recently built waccOS and before that a 3D game engine in bare-metal ARM11 assembly. Unfortunately, we had a lot of issues with unaligned access, especially with the behavior of LDR! At the end of the day, a post like this would have helped enlighten us on how interesting unaligned access on ARM is.

--

--