Journey to bare-metal main()

Bao Nguyen
6 min readMay 25, 2023

--

To learn more about OS development and get some hands-on experience with ARM, especially ARM64, I decided to build a Type-1 hypervisor on a Raspberry Pi. This post summarizes my learnings while trying to boot to C world. My development setup includes:

  • 1 Raspberry Pi 3A+, which has a Broadcom BCM2837B0, Cortex-A53 64-bit SoC
  • Macbook Pro 16 with M1 Pro
  • qemu 7.2.0 for emulation
  • clang+llvm-16.0.0-arm64-apple-darwin22.0 for cross-compilation and debugging

I mostly followed the design and implementation of Xen on ARM64, which in combination with the Raspberry Pi bootloader process, resulted in the following high level steps for booting a bare-metal hypervisor:

  1. Initialize boot CPU
  2. Create boot page tables
  3. Enable MMU
  4. Switch to runtime virtual mapping
  5. Setup a runtime stack
  6. Jump into C

Initialize boot CPU

With RPi bootloader, only 1 CPU is booted and starts executing at EL2. The other cores are halted. This means we don’t need to check for CPU ID when starting.

This section is fairly simple as the main thing we need is to configure the EL2 Translation Control Register, TCR_EL2. This is mostly brought over from Xen:

  • 48-bit VA space
  • PT walks use Inner-Shareable accesses
  • PT walks are write-back, write-allocate in both cache levels
  • Top byte is used

Then SCTLR_EL2 (System Control Register) is configured with MMU and D-Cache turned off

Create boot page tables

This is where it gets interesting. Let’s start off by trying to understand why we need these page tables.

When the RPi boots up, we’re dealing directly with physical address, with an offset to where we’re actually loaded by the bootloader. For example, the bootloader loads us at physical address 0x8000, and in our linker script we map ourselves at a high address, such as 0x0000008000280000, let’s call this HYP_VIRT_START. This means that there is an offset of (HYP_VIRT_START — 0x8000) between what we’re seeing in software and what is mapped in physical address. We want to have this mapping of HYP_VIRT_START to 0x8000 in the page tables so that after enabling the MMU, we can unanimously refer to high VA address only. This is specifically important for the Program Counter (PC) as it starts from the physical address 0x8000, and at some point we will want to make a jump to our virtual address range.

Let’s quickly review the translation scheme for AArch64, we’re gonna use 4KB as the translation granule for simplicity, which for a 48-bit address space, can use up to 4 levels of translation table, with L3 table being the last level and will point to a single page itself.

The diagram is pretty straightforward. Given an input address, we will slice and dice it to get the indices into each of the translation table, with base physical address for L0 table be stored in TTBR0_EL2 (Translation Table Base Register).

Each page table entry is a 64-bit value, and we use 9-bits to index into any page table, so each page table will be 4096 bytes (4KB) in size, which is exactly 1 page. So we can reserve some pages in the DATA section for these page tables, which can be done with our linker script.

We want to map HYP_VIRT_START to 0x8000 with these page tables, using the above translation scheme and with HYP_VIRT_START equals to 0x0000008000280000, we will populate the them like this:

  • L0 table: 1st slot → L1 table physical address
  • L1 table: 0th slot → L2 table physical address
  • L2 table: 1st slot → L3 table physical address
  • L3 table: 128th slot to 511th slot → physical pages from 0x80000 to 0x1FF000

That’s about it! Well actually not yet… As mentioned above, an interesting aspect of turning on the MMU and using these page tables is the Program Counter. How do we switch it from our physical range at around 0x8000 to our virtual range at HYP_VIRT_START? Recall that immediately after switching on the MMU, the low address range is unmapped and any accesses will be a fault.

So to support the PC until it has jumped to our VA mapping, we need another temporary mapping called identity (or 1:1) mapping. This mapping simply maps low VA range directly to low PA range, so that after turning on the MMU, we can still access the low VA range and continue execution until we jump to high VA range. Since these should only be a few instructions, we only need 1 page for this identity map.

Setup fixmap

A small side note before we can enable the MMU. On Raspberry Pi we can interact with hardware devices (UART, SPI, etc) via memory-mapped registers, they are called Memory Mapped IO (MMIO). These devices are mapped at specific addresses in the physical address space and we can read/write to these addresses to control the devices.

Similar to our text section, after enabling the MMU, we will not have access to these MMIO devices as their physical addresses are unmapped. We need to reserve a virtual address region to map these devices.

This region is called fixmap, and we can reserve 2MB for it right after our text, data, bss sections. Currently I only map the UART device in this region, but it can be extended to support other things. This is important if we want to initialize and use UART early in the boot phase for debugging purposes.

Enable MMU

This step is pretty straightforward, at least on paper… This is also the step where I discovered the most bugs (mostly from previous steps). I will document some errors I encountered after enabling the MMU and before jumping to C-world at the end of the post.

To enable the MMU, all we need to do is point TTBR0_EL2 to the base physical address of L2 table, then write to bit 1 of SCTLR_EL2 to turn on MMU, with some TLB flushing shenanigans along the way and we’re done.

Switch to runtime virtual mapping

Now we’ll actually do the PC jump that we touched upon above. This is also straightforward, unless we messed up our page tables :) We simply load the virtual address of a function into a register, then branch to it.

After we’re in the virtual mapping, we don’t need the 1:1 mapping anymore and we can remove it.

Setup a runtime stack

One last thing before jumping into C, we need to prepare a stack since compiled C code will make heavy use of Stack Pointer (SP) register.

Similar to the page tables, we can reserve a DATA portion for our stack using the linker script. I started with 1 page for the stack size.

Jump into C

And… that’s it! Now we just need to load the address of our main() function, and jump to it!

If only everything went just as smoothly… Below are some of the issues I encountered along the way, mostly related to setting up the page tables:

  • Instruction right after enabling MMU fail:
    This means that there was no mapping that maps to the low physical address range that the PC started in. Double check that you had a correct 1:1 mapping to support this range.
  • Jumping to runtime virtual mapping fail:
    This most likely means that the virtual mapping is not working as expected. Double check by using the virtual address that you’re jumping to, calculate the translation table indices, and verify that the page table entries at those indices are correct, most importantly the L3 table.
  • Jumping to C fail:
    This might mean we didn’t set up the stack properly. Double check that the SP was set up with a usable value that is within a mapped VA range.
    This could also mean the text section of the main() function was not mapped. Double check where we are linking this in our linker script and ensure that it is mapped in the page table.

That’s it for now! After writing this I realized it sounds much simpler on paper than actually doing it… But it was a fun learning experience, now onto the next steps of building a hypervisor!

--

--