Morello and memory pools

Published in

THG Tech Blog

10 min readSep 14, 2023

The team has been working on a project as part of UKRI’s Digital Security by Design research. Our work has been focused on mitigation of network level attacks and our software stack is primarily a C project (with eBPF programs to take advantage of zero-copy XDP (AF_XDP) in newer versions of the Linux kernel).

An Arm Morello processor implementing the CHERI architecture

At the time of writing, the most well-developed OS for the Morello processor is a fork of FreeBSD called CheriBSD. Unfortunately for our project, BSD doesn’t contain the eBPF and AF_XDP support needed to run the entire system. (A port of Linux to the Morello architecture is underway, but incomplete at the time of writing).

However, we can still utilise the Morello architecture and compile much of our code on this processor to understand the CHERI extended hardware protection and, crucially, to confirm that our code is as robust as it can be on both Morello and x86_64/amd64 ISAs.

CHERI

CHERI is an acronym of “Capability Hardware Enhanced RISC Instructions” (which ties in well with calling your CPU, Arm “Morello” (a type of cherry)).

What this means in practice is that in the hardware itself is the concept of a “capability”. This hardware capability is currently implemented in two different ISAs; Arm Morello, and more recently RISC-V. Key to this implementation is the fact that a pointer in CHERI is 129bits (instead of the current standard in the industry: 64bit). This allows a CHERI pointer to contain not only the address of something, but the capabilities associated with that address (and a tag bit).

validity tag, permissions and bounds along with normal address

As the size of a pointer on CHERI enabled hardware is different, the C compiler needs to be able to produce different object code. A new pointer type is added; ptraddr_t which can be used to hold these additional data. The LLVM/Clang toolchain has been modified by the CHERI team to work with these new pointers.

Limitations

Over the course of our research with the prototype hardware and using the provided buildchain, we have come across a few limitations. A major problem is that the additional CHERI capabilities, bounds information and validity tag are only useful on CHERI enabled hardware: obvious really, right?

The issue with this restriction is that any accelerator card (GPU, FPGA etc.) added to the CHERI host doesn’t understand these special pointer types — worse still, the capabilities are, by necessity, stripped out as any data is copied to/from any add-on accelerator card. There were initial inklings that there would be limits in this area but the full implications of this didn’t sink in until we had a software solution which we wanted to port to an FPGA and compare the results of the FPGA accelerated version, and the team were digging into CCIX and CXL.

The only workaround involved lots of copies of the data to and from the FPGA — removing any performance advantage it may have provided us with, and we still would have lost the protection of the CHERI pointers!

But surely there wouldn’t be any issues with a user-space application running entirely on a CHERI enabled CPU?

CHERI pointers and protection

We’ve already established that a ptraddr_t carries additional information including “capabilities” and bounds. These allow the hardware itself to signal a fault if an address is accessed when it shouldn’t be, for example a read-only pointer was used for a write. This is an excellent mechanism for preventing an entire class of software bugs that lead to use-after-free, out of bounds access, incorrect pointer arithmetic etc. errors which are a large fraction of the security vulnerabilities often found in native code.

If we compile our code for AArch64c and enable purecap as a compiler option, then the morello hardware will raise SIGPROT when/if our code violates these rules.

19/06/23 06:58:00 BST mempooltest | 00 | INFO             | Compiled with --enable-mempool-alt-free and --enable-mempool-bound-enforcement:
19/06/23 06:58:00 BST mempooltest | 00 | INFO             | allocating stuff
19/06/23 06:58:00 BST mempooltest | 00 | INFO             | finished allocating junk; num of blobs on list[1000000]
19/06/23 06:58:00 BST mempooltest | 00 | INFO             | releasing time!
In-address space security exception (core dumped)

Here you can see an example log output from some test harness code that was incorrect when freeing some memory — the hardware protection kicked-in and the process was aborted instead of continuing with incorrectly free’d memory.

The above test harness code was designed to exercise specific alloc/free patterns we needed to test (which we’ll get to later), however simply running our code on CheriBSD also revealed we had some unknown errors which were caught by the hardware protection.

Example: incorrect sizeof

In a different example, we were accidentally allocating some memory based on the size of a function, rather than the size of the struct of the same name:

kevent function — sizeof is size of pointer to function

The code in this example is part of our BSD build as we do not use kqueue in our Linux builds. FreeBSD did not notice this error when, later, we started reading or writing to some uninitialised memory. CheriBSD running on the morello hardware however, threw a SIGPROT because these later reads and writes were out of bounds from the pointer we had received from our malloc routine.

This is just a small example of how the CHERI hardware can help to prevent memory safety bugs causing further problems in later processing.

Slabs, mempools and allocators

The application we are working on relies on extremely low-latency and highly-performant code. We spend a lot of time and effort rewriting parts of the code to achieve our goals. One of the techniques we have used is the concept of a “memory pool”.

Memory pools allow us to pre-allocate a large amount of contiguous memory to store a pool of structs/types. Then we can use a struct directly from the memory pool in a fast-path without waiting for the system call associated with a malloc. When releasing memory back to the pool (our equivalent of free) we also avoid a system call.

Memory pools are nothing new, and the Linux Kernel (and other OSs) ship with different implementations of memory pools — and the techniques often go hand-in-hand with selecting specific allocator algorithms.

The Linux kernel actually has support for SLABs, SLUBs and SLOBs — however SLUB is now the default allocator and SLOB is intended for use in memory constrained environments. The original SLAB allocator hasn’t been used for quite a while. All these are variants of slab allocators.

A (toy) memory pool implementation

In our case, we’ve implemented a memory pool based on a simplified version of an allocator in the Tor Project, which is similar to the K&R slab implementation.

We have double linked lists to maintain pointers to objects (structs) to free, objects (structs) in use and finally a list of full objects. Our implementation is a little different in that we consider a much simpler case — every allocation from our memory pool will be an identical size. Essentially our allocations are sized to a struct (or type).

Obviously a SLAB/SLOB/SLUB allocator for the Kernel needs to take into account varying size allocations, but we can ignore this requirement for our use case and this simplifies the code.

The structs for our toy memory pool

An interesting trick is to use a union of either a pointer to the next free allocation or a pointer to the data itself in the elem_alloc struct.

When we allocate from our pool, there are two scenarios:

There is a free block in the used_blocks list we can just grab the data pointer from that and use it.
Else we need to check our empty_blocks list for an empty block.
If we don’t have have an empty block, allocate a new empty block and adjust the housekeeping variables and lists.

Here’s what this looks like in code…

The result of this is that we only ever make a malloc syscall (in the block_new function) if there is no free space in any of the in use blocks or in the empty blocks. If we have free space we simply return a pointer to it and adjust the housekeeping lists & variables appropriately.

The next step is how to handle free-ing memory (or releasing back to the memory pool).

One feature of this instance of memory management, is that “freeing” involves moving blocks of previously allocated memory between full/used/empty lists, not actually returning the memory back to the OS. This could have dire consequences (memory leak) if our use case wasn’t bounded in some other fashion. It would be simple to adjust the pool to properly free blocks in a background thread based on a TTL, or to free when a fixed size is reached.

Morello idiosyncrasies

Based on our working memory pool/block allocation we built our software and then started adapting the code to take advantage of Morello features. First of all compiling in purecap and gaining the wonderful smug sense of security that our code is properly respecting the address bounds as the CPU would signal a fault otherwise.

Further reading of the CHERI C programming guide however…

“7.6 Implications for memory-allocator design

One use case of these APIs is high-performance applications that contain custom memory allocators and wish to narrow the bounds of returned pointers.”

This is exactly our usecase and upon re-reading the guide we realised that our assumed memory safety was based on a false sense of security!

Bounds restrictions

As we request an allocation of an element (struct) from our memory pool, the pointer we’re given back has it’s memory upper bound set to the max extent of the containing block. This flaw essentially means that we can easily address memory outside of the struct (up to the upper bound of the block) breaking one of the supposed security features of Morello!

To fix this issue during the allocation path, we need to extract the element from the element container struct (elem + metadata):void *elem = &(alloc->mem.elem);

Then we need to call the appropriate CHERI API to reset the upper bounds of this pointer:elem = cheri_bounds_set(elem, pool->requested_elem_size);

As we store additional metadata in our containing struct, we need to ensure that we set to the original requested size, not the size of the elem_alloc_t

This container contains the element, or a pointer to the next free allocation and, importantly, a pointer back to the block that contains this element…

With the allocation path now fixed to properly respect the correct upper bounds for an element retrieved from our memory pool, it was time to consider the free path.

To correctly “free” an element we would need to understand which block it was allocated from — oh, that pointer to the block is now outside of the bounds that are available to us when we have a pointer to the element — well that sucks!

We need to work out which block our element was originally allocated from so we can use the cheri_address_set function to adjust the bounds of the element to match the block bounds once more.

As our allocation strategy is contiguous memory, we can use the element address and check to see if it lies within the address space of each of the blocks on either our used (in-use) list or our full list. In our production code we know that these lists of blocks rarely reach double figures, so a linear walk is not too expensive — although we have the option to swap this out for a table lookup in the future if we notice performance problems associated with this strategy.

Upon finding the containing block, we have effectively got back the pointer-to-block metadata that we lost when setting the bounds on “allocation” earlier, so now we can adjust the address back to this containing block:

memblock_t *container_block = find_elem_block(pool, elem);
uintptr_t elem_cap_addr = 0;
elem_cap_addr = cheri_address_get(elem);
elem = cheri_address_set(container_block, elem_cap_addr);

Changing the bounds of our elem pointer allows us to release the element:

Conclusion

With both the allocation and release flows adjusted to take into account the need to adjust the memory bounds of the returned pointers, our final memory pool/block allocator was completed to work with CheriBSD and Arm Morello hardware.

We then refactored the calls to the CHERI API out of the simple test harness and introduced our own library code that handled the abstraction between CPU architectures, providing no-op variants of CHERI API calls when running on x86_64/amd64 or Arm (without capabilities).