LLVM’s garbage collection facilities and SBCL’s generational GC

Intro, overview

This document comes in three parts:

  • an introduction into basic tradeoffs in GC design (speed in different places, RAM usage, safety etc)

Tradeoffs to think about

Keeping it brief:

  • speed versus space, as always

Mainline speed versus GC speed it a highly complex question. There are many applications where you can accept long GC times if it speeds up mainline code.

The major example are query-oriented services. Just serve the query without GCing (unless it’s an unusual query), then GC between queries while nobody is waiting on the result. Obviously you want to reduce user-visible latency here by making mainline code as fast as possible, with the least overhead for helping with GC bookkeeping. Also obviously, total CPU time used for the server over the day can go up dramatically (which also means the power bill). ITA/Google’s QPX search engine is a classic example, it is (mostly) query-oriented and in some uses users are waiting.

Another example where you want to keep for-GC bookkeeping overhead in mainline code low is if you can “GC-by-fork”, which is basically throwing away processes as they would need to GC, resuming operations from an already positioned server (via Unix fork(2). Here you will only run the slow GC code on unusual queries, and you want to reduce most queries’ latency by making mainline code fast.

GC-by-fork worked well for me in QPX, however you cannot use it anymore when you have to use libraries that start threads but do not offer an API to tell this library to park its threads in a fork-safe position. There also also complex interactions between virtual memory mappings, physical pages and how they affect fork(2) speed in different operating systems. SBCL uses VM page protection for GC help, and hence creates a large number of individual VM mappings, which can slow down fork on many Linux versions.

Or think of an interactive application sitting there waiting for user input. Why not use that time to GC, and if the user stirs during GC abandon that GC run? Or fork, GC in the forked process, and then depending on timing of user and GC continue your application either in parent or child. In all those cases you would pick mainline speed over GC speed.

I don’t want to go too deep. I hope I explained why “one size fits all” does not really do it in garbage collection. I wrote more on this topic here: https://medium.com/@MartinCracauer/generational-garbage-collection-write-barriers-write-protection-and-userfaultfd-2-8b0e796b8f7f


This is about the slightly conservative generational GC that SBCL uses on i386 and amd64. SBCL has other GCs as well, but gencgc is the most tuned one since it is used in pretty much all production uses of SBCL.

LLVM has several different mechanisms for GC support, a plugin loader and a bunch of other choices, and things change quite rapidly. It also turns out that just basic memory allocation in SBCL isn’t what LLVM expects right now (LLVM wants to go through a function, but SBCL allocates by incrementing pointers, see https://github.com/sbcl/sbcl/blob/master/src/compiler/x86-64/alloc.lisp#L89).

There are very few garbage collecting languages on LLVM right now, and they have driven much of the interface design so far.

First, a description of how SBCL’s generational GC works:

  • written for CMUCL by Peter Van Eynde @pvaneynd (ETA: Peter says it wasn’t him, Douglas Crosher maybe?) when CMUCL was ported to x86, later used when SBCL split off CMUCL and amd64 was implemented

This GC scheme ends up with the following properties:

  • fast memory writes. There is no overhead to instructions that write to memory. Since we are using OS facilities for the write barrier there is no need to annotate each memory write with a second write doing bookkeeping for the write barrier.

Other memory management in SBCL:

  • SBCL compiled code does not use an allocation function for most allocs. Only large allocs or allocs hitting the end of an alloc area call a functions. All memory allocation in between is doing by atomically incrementing a free-space pointer, multiple threads using the same allocation area and the same free-space pointer (that works because they use compare-and-swap to get their alloced place).

This is potentially tricky to preserve, but I would like to keep it. The speed at which some garbage can be generated is enormous, and it all adds up.

SBCL’s GC is copying (moving) by default, which enables this scheme. There is no fragmentation. Just just blast new objects into place in a simple one-after-another manner.

There is no malloc-like functions which has to keep track of fragments to maybe fill them, to maintain heap pools in lists or other collection classes. A malloc function for all practical purposes needs to either use thread-specific heap pools or use thread locking. Doing a malloc in C is several orders of magnitude more expensive than in SBCL code. More so in multithreaded code. In SBCL code there is a lock-free way to generate a bunch of stuff on the heap with no function calls.

There also is no zero-intialization of allocated space. Common Lisp does not allow uninitialized variables, so the allocated space is overwritten before use with the initialization value (as opposed to first zeroing it, then writing again with the initialization value).

All this makes various LLVM mechanism look slow, namely LLVM usually expects you to go through functions for allocation, and it zeros memory.

Compiler properties of SBCL:

  • no aliasing of pointers in generated code (we come to that later)

To iterate a bit on the basics: CMUCL and SBCL are native compilers. They compile to machine code, they do their own code loading and starting. They do not use OS compiler, assembler or linker (except for the executable image). CMUCL/SBCL have various transformation phases into Lisp-like abstracts, then an abstract machine language, then a very simple and linear step from abstract machine language to the requested architecture’s binary code.

My interest in LLVM results from that last step. No optimization is done during that last phase (abstract assembler to machine assemble), and no optimization happens on machine code. The resulting code can look repetitive and wasteful when human-reading it. LLVM is precisely what I need: an assembly-to-assembly optimizing machine. If I could just replace the last 1 or 2 steps in SBCL with LLVM I would get much better looking machine code.

Now, it is debatable and was debated inside Google whether the current state of affairs is actually doing much harm. Code is just a little bit bigger, I was more concerned about the needless instructions. To test this I made C functions that did roughly the same as my Lisp functions, then compiled C to assembler, edited the assembler file to be wasteful the same way that SBCL’s last stage is wasteful, and benchmarked. Modern x86 CPUs blast through a couple of needless or inefficient instructions with amazing speed. Since the storage dependencies are obviously quite favorable in the instruction stream (we are talking about repeated manipulations of same locations, using the same read data), the difference was barely noticeable. I had decided against “selling” this huge project to be done at work. The outcome is uncertain. Now I’m on my own time and I want to know more.

LLVM GC requirements:

  • LLVM might alias pointers, calling them “derived pointers”, as part of optimizations. A moving GC that needs to find all pointers to an object that it moved must be informed of such copies and where they live, so that the copy can be adjusted, too. This is ordinarily not difficult if you just place the copies in the heap (where scavenging will find them), but that isn’t necessarily done that way (would require wrapping them). This needs thinking over.

What I don’t pay for now (in SBCL):

  • In SBCL right now I am not forced to make memory writes more complicated. I would like to direct the LLVM facilities for pointer aliasing to retain that. Depending on how userfaultfd(2) works out I might switch to a bitmap scheme for the write barrier, but right now I expect the VM page approach to be better for me, given proper OS support. I would not like to make memory writes slower.

Final words

LLVM’s current facilities allow you to write a good garbage collector. However, your choices to pick specific tradeoffs are limited.

It is very difficult to estimate how this works out in practice, since LLVM does not have a language right now (to my knowledge) that optionally generates code as compact as C and does garbage collection, the way SBCL does. Benchmarking GC schemes for total time, much less for mainline versus GC time overhead, cannot be done in languages that use heap memory in a lot more places than C or SBCL-generated code do.

It is very rare to see two equally tuned GCs in the same language implementation (no, the JVM’s three garbage collectors are just one, with a couple options on how to run it). The requirement that most fast GCs need mainline code to do some bookkeeping makes it very annoying to switch between GC, not to mention correctly implement them both in the first place.

In the case of SBCL. Well, as people know my pet project is to use LLVM as a replacement for SBCL’s last stage, which is abstract assembly to machine assembly. That stage does no optimization in SBCL right now and LLVM is all about those specific optimizations. I got discouraged about the potential speed gains when I tested hand-written assembly code to do the stupid things SBCL does in its last stage versus code that does not. The result of those tests was that modern CPUs don’t care. They blast through the slightly inefficient assembly code almost the same as through the optimized code. Obviously, the optimizations I miss in the last stage of SBCL and that LLVM has are to a significant extent implemented in the CPUs, too.

(I need to clarify that this only works for optimizations past program flow rearrangement, such as when you already have abstract assembly and generate machine assembly. All previous optimization stages are critical for speed and the CPU’s smart dependency graph games cannot help if you just run through a lot more code on a scale that isn’t covered by the microarchitecture look-ahead. For example, needlessly using a lot of heap space will always kill you, no matter how long you hold that heap memory)

Overall I left this with mixed feelings. LLVM’s GC facilities are not bad, and the restrictions of what you can do as far as GC design is concerned are not unreasonable from the perspective of a generalist.

At the same time I am sure that if you had ran ITA/Google’s QPX engine with the restrictions LLVM imposes through the time that I was part of providing services on it (2001–2017) you would see a lot of grief, grief that is unhealthy for startups. Memory management on all levels (heap, stack, VM, RAM) was a key factor in ITA’s success. Thanks to the airlines filing lots of crazy fare rules we needed the ability to bend memory to our will.

Appendix 1 — leaky GC

I just noticed that I didn’t speak about precision and even correctness.

Unwanted retention: Obviously, a GC that is in any way conservative has a risk of holding on to memory when it should not. An integer that floats around in a conservatively scanned area that could be a pointer and you are there. Theoretically this could hold down very large amounts of physical memory. Imagine that such a false-alias integer points into a connected graph of connected size. That as such is not too unlikely, especially in a query-oriented system. Imagine you build a large graph on each query and want to discard it at the beginning of the next query. If you have a conservative GC nail down such a graph from a previous query on a regular basis you could multiply memory usage.

On the other hand, in a 64 bit VM address space you can move the Lisp heap high enough that false aliases become less likely (integers in computer systems are statistically much more likely to be small). Keep in mind SBCL is not heap-to-heap conservative. The large heaps are not sources of possible conservative pointers (neither the Lisp nor the C heap). Stacks are the biggest source for SBCL. This is in contrast to e.g. the Boehm collector for C, which is always fully conservative, heap-to-heap and stacks, registers etc, too.

As my friends know, SBCL had a misfeature in that nailing memory down from one conservative pointer could spread to 4095 more pointers (all in a GC card), which caused some trouble for my toy at work and lead to a talk on the ELS 2015 Lisp conference in London.

Losing memory: Don’t laugh, from my perspective of query-oriented systems and being used to fact cloning of processes (fork(2), or quick start from mmap(2)ed Lisp images) I can imagine a use for a GC that sometimes collects too much. It would have to bribe me with some major advantages in speed or somesuch of course. The way it would work is that freed memory is replaced with memory areas that are not mapped, giving me a segfault when something tries to access memory that is now incorrectly gone. That segfault is the same thing that SBCL now uses for keeping track of generations that have not been modified since they were last scanned and marked readonly. It is not an exotic concept, virtual machines use it, too. It is very expensive right now, this is where userfaultfd(2) should help. https://www.cons.org/cracauer/cracauer-userfaultfd.html

Anyway, so I get notified that I had a piece of code that had a leftover pointer to a piece of data that should have been preserved but was not. I can decide how I want to handle that. Imagine again that I have a system that is (a) query-oriented and (b) creates new instances by fork. I can simply through away the child and restart the query in the parent (or in a new child), this time with GC told to be precise. Depending on how much speedup you get from the fast-n-loose GC this can very well pay off. If you have plenty CPU you can also start fast and slow version at the same time, using the slow version’s result if the child doesn’t work out, or cancelling the slow version when the fast one made it.

Think of it as a systemwide option to use weak pointers, rarely, when they would create speed, and weak pointer reactions are either done by the system (as opposed to the code directly using the pointers), or code that can deal with the situation can do as it sees fit. You would fine tune that a bit, e.g. you probably don’t want to throw away code that is still in use.

Reference picture of a SBCL GC cards:

Writing about programmer productivity, teambuilding and enabling a wide variety of different engineering personalities. CTO at www.thirdlaw.tech #lisp #freebsd

Writing about programmer productivity, teambuilding and enabling a wide variety of different engineering personalities. CTO at www.thirdlaw.tech #lisp #freebsd