Poring OS to Aarch64

6 min readAug 13, 2019

Aarch64 is ARM’s 64-bit architecture (sometimes it’s called arm64). In this article I’ll cover basic differences between Aarch64 and ARM and which parts of the operating system need to be rewritten.

It’s not a detailed guide; it’s more like a review of how this new architecture feels after working with “old” ARM systems. It’s all based on my experience with Embox, which I ported to Aarch64 recently. If you’re going to add Aarch64 support for another system by yourself (or just do more in-depth learning of the topic) you need some technical literature; I’ll leave links for some docs at the end of this article.

Aarch64 looks more like another platform, and they don’t have much common with ARM. There’s “intermediate” platform named Aarch32, which is a superset of ARM commands, but I don’t have any experience with that, so I will not go into details.

Brief list of changes (Aarch64 vs. ARM):

General purpose registers are now 64-bit wide, and there are 32 of them.
There is no “coprocessor concept” now, system registers are being accessed just with their name, e.g. msr vbar_el1, x0 (compare to old-style mcr p15, 0, %0, c1, c1, 2)
New MMU model (so old ARM virtual memory can’t be used).
Strong change of the privelege model. Previously we had unprivileged (USR) and privileged modes (SYS, IRQ, FIQ, ABT, …); now there are 4 levels: EL3, EL2, EL1, EL0;
AdvSIMD replaced NEON, floating-point operations use new module as well.

Now let’s talk about some of these points in more detail.

Instruction set, registers

r0-r30 are general-purpose registers, they can be accessed in both 64-bit (x0-x30) and 32-bit (w0-w30; least significant bits are accessed) modes.

Instruction set for Aarch64 is called A64. You can find instruction set reference here. Basic operations are the same with ARM:

 mov w0, w1 /* Write w1 value to w0 */
 add x0, x1, 13 /* x0 = x1 + 13 */
 b label /* Jump to label
 bl label /* Jump to label and store return address to x30 */
 ldr x3, [x1, 0] /* Store x3 to [x1] address */
 str x3, [x0, 0]

Some commands are different:

Now x31 is a zero-register `rzr/xzr/wzr`, which is always read as zero and dismisses any write operations.

subs xzr, x1, x2 /* Substract x2 from x1; NZCV flags are updated, but the result of substraction is not stored */

No multiple register stores/loads (stmfd sp!, {r0-r3}); work with registers in pairs; stack pointer should be 16-byte aligned:

 stp x0, x1, [sp, 16]!
 stp x2, x3, [sp, 16]!

Program counter register now can’t be accessed directly (i.e. mov and lds don’t work), only special instructions (e.g. ret, bl) can be used.
CPSR register is gone; several other system register now store the same information: DAIF register (AIF — are the exactly A, I and F bits of CPSR), NZCV register (it stores negative, zero, carry and oVerflow flags — the same NZCV bits of CPSR) and System Control Register (SCTLR, it controls cache, MMU, endianness and so on).

This brief instroduction to Aarch64 assembly language should be enough to write a simple loader to pass control to the platform-independent code of your systen :)

Execution modes

Fundamentals of ARMv8-A is a good paper to describe execution modes and switching between them, I’ll write its short summary:

There are 4 execution levels in Aarch64 (EL for short):

EL3 — Secure Monitor
EL2 — Hypervisor
EL1 — Operating Systems
EL0 — Applications

EL1 can execute both Aarch32 and Aarch64 systems, but an AArch32 OS can’t host an AArch64 application.

eret instruction is used to lower the EL (SPSR should be prepared properly), interrupts/syscalls/aborts raise the EL.

SPSR, ELR and SP are banked for each EL (i.e. there are different physical registers SPSR_EL1, SPSR_EL2…).

A lot of system registers are banked as well, for example there are ttbr0_el2, ttbr0_el1 for MMU context base address. Accessing system register without a proper privilege level will cause abort.

MMU

Armv8-A implements MMU ARMv8.2 LPA, which is described in D5 chapter of ARM Architecture Reference Manual for Armv8, Armv8-A.

This MMU model has 4KiB (4 levels of translation tables), 16KiB (4 levels) and 64KiB pages (3 levels). At any intermediate translation table you can place a block to map a large region of memory.

There are some minor changes: now there are no domains; some status bits were added (e.g. dirty bit).

Apart from memory blocks, this MMU is not something new, just another block for virtual memory.

Advanced SIMD

AdvSIMD replaced NEON, and there are major changes both in memory organization and instruction set.

NEON had 16 128-bit registers:

Now there are 32 128-bit registers:

Aarch64 registers for FP and SIMD operations

Reference for SIMD/FP commands is available here.

Basic FP operations:

 fadd s0, s1, s2 /* s0 = s1 + s2 */
 fmul d0, d1, d2 /* d0 = d1 * d2 */

Basic SIMD operations:

 /* Just for comparison: NEON instructions have postfix to specify operation width */
 /* q0 = q1 + q2, each register contains 4 32-bit floats */
 vadd.s32 q0, q1, q2
 
 /* AdvSIMD: registers have postfixes to specify access width */
 /* v0 = v1 + v2, each register contains 4 32-bit floats */
 add v0.4s, v1.4s, v2.4s
 /* Sum all elements of v1 vector and write result to d1 */
 addv d1, v1.ds
 /* Write 0 to every element of v1 vector */
 movi v1.4s, 0x0

Platforms

QEMU

QEMU supports Aarch64, `virt` is one of platforms. To run it in Aarch64-mode, you should pass `-cpu cortex-a53` like this:

qemu-system-aarch64 -M virt -cpu cortex-a53 -kernel ./embox -m 1024 -nographic # ./embox — is ELF with OS kernel

Good news: a lot of peripheral devices are the same for “old” ARM platforms, for example, PL011 for UART, ARM Generic Interrupt Controller and so on. Of course, they have different register base addresses and IRQ numbers, but that’s not a big problem. On Aarch64 systems they support 32-bit access as well, so they work without code changes.

QEMU starts image in EL1.

i.MX8

i.MX8MQ Nitrogen8M was initial reason why I ported Embox to Aarch64.

For some reason u-boot starts with MMU turned on and gives EL2 for system; that causes problems with GICv3 configuration and other stuff.

Some drivers from i.MX6 (UART, FEC for ethernet) worked well without major code changes, GICv3 was the biggest problem. I still didn’t provide full support for Aarch64, but basically it works (you can run basic console commands, for example).

Conclusion

It’s not that hard to write Aarch64 modules if you’re familiar with ARM concepts. Of course you can stuck with some problems, but there are plenty of docs to handle any problem.

You can try out Embox with Aarch64 on QEMU, source code are available from out Github repo.

Useful links

A64 instruction set
Fundamentals of ARMv8-A
ARM Architecture Reference Manual for Armv8, Armv8-A
Aarch64 ABI (call convention)
Migrating code from ARM to ARM64 — good presentation with recommendations for writing portable code