LLVM’s garbage collection facilities and SBCL’s generational GC
This document comes in three parts:
- an introduction into basic tradeoffs in GC design (speed in different places, RAM usage, safety etc)
- a description of what SBCL on amd64 does right now, what LLVM offers (or wants to offer) and how they map to each other
- me mumbling that LLVM’s current constraints allow me to implement a good garbage collection, it also prevents me from pickup many constructs that I have used to provider better GC in specific situations (namely in ITA/Google’s QPX search engine in SBCL)
Tradeoffs to think about
Keeping it brief:
- speed versus space, as always
- “speed” of the garbage collector as a tradeoff to speed during mainline code (non-GC code). Almost all GC schemes require mainline code to do things for the benefit of the GC.
- for the GC itself total CPU time (over all CPUs) during GC, GC real time and mainline code pauses (“GC pauses”)
- for mainline code it is mostly whether: the overhead for for-GC housekeeping is CPU time or system time; if it is system time, are those cheap or expensive system calls (VM map changes etc); does the code get bigger (threatening to hijack more L1 code cache lines); does the extra data for GC’s benefit have significant size?
Mainline speed versus GC speed it a highly complex question. There are many applications where you can accept long GC times if it speeds up mainline code.
The major example are query-oriented services. Just serve the query without GCing (unless it’s an unusual query), then GC between queries while nobody is waiting on the result. Obviously you want to reduce user-visible latency here by making mainline code as fast as possible, with the least overhead for helping with GC bookkeeping. Also obviously, total CPU time used for the server over the day can go up dramatically (which also means the power bill). ITA/Google’s QPX search engine is a classic example, it is (mostly) query-oriented and in some uses users are waiting.
Another example where you want to keep for-GC bookkeeping overhead in mainline code low is if you can “GC-by-fork”, which is basically throwing away processes as they would need to GC, resuming operations from an already positioned server (via Unix fork(2). Here you will only run the slow GC code on unusual queries, and you want to reduce most queries’ latency by making mainline code fast.
GC-by-fork worked well for me in QPX, however you cannot use it anymore when you have to use libraries that start threads but do not offer an API to tell this library to park its threads in a fork-safe position. There also also complex interactions between virtual memory mappings, physical pages and how they affect fork(2) speed in different operating systems. SBCL uses VM page protection for GC help, and hence creates a large number of individual VM mappings, which can slow down fork on many Linux versions.
Or think of an interactive application sitting there waiting for user input. Why not use that time to GC, and if the user stirs during GC abandon that GC run? Or fork, GC in the forked process, and then depending on timing of user and GC continue your application either in parent or child. In all those cases you would pick mainline speed over GC speed.
I don’t want to go too deep. I hope I explained why “one size fits all” does not really do it in garbage collection. I wrote more on this topic here: https://medium.com/@MartinCracauer/generational-garbage-collection-write-barriers-write-protection-and-userfaultfd-2-8b0e796b8f7f
LLVM and SBCL
This is about the slightly conservative generational GC that SBCL uses on i386 and amd64. SBCL has other GCs as well, but gencgc is the most tuned one since it is used in pretty much all production uses of SBCL.
LLVM has several different mechanisms for GC support, a plugin loader and a bunch of other choices, and things change quite rapidly. It also turns out that just basic memory allocation in SBCL isn’t what LLVM expects right now (LLVM wants to go through a function, but SBCL allocates by incrementing pointers, see https://github.com/sbcl/sbcl/blob/master/src/compiler/x86-64/alloc.lisp#L89).
There are very few garbage collecting languages on LLVM right now, and they have driven much of the interface design so far.
First, a description of how SBCL’s generational GC works:
- written for CMUCL by Peter Van Eynde @pvaneynd (ETA: Peter says it wasn’t him, Douglas Crosher maybe?) when CMUCL was ported to x86, later used when SBCL split off CMUCL and amd64 was implemented
- generational, with pages/segments within generations (I call them “GC cards” to avoid confusion with VM pages)
- slightly conservative. Due to shortage of registers on x86 the code generator for x86 decided not to split the register set into tagged and untagged ones (on RISC one half of the registers could only hold non-pointers, the other half only pointers. So the GC would specifically know what is a pointer and what is not). SBCL’s genCGC is not conservative when it comes to heap-to-heap, only for potential pointer sources such as registers, the stack (avoidable but done for now), signal contexts and a couple other places.
- Being conservative with regards to the stack. That creates a major clash with LLVM which expects a stack map. A stack map declares which data type can be found where on the stack for a function (as in every invocation of that function needs to comply to it), whereas SBCL’s compiler can and does generate code that just places anything anywhere. This makes the stack a lot more compact in some call patterns.
- supports a write barrier but no read barrier. This means that parts of the heap can be excluded from scavenging (scanning) when it can be proven that they once did not point to our to-be-gced area and have not been modified since then.
- the write barrier is implemented by VM page protection and using SIGSEGV. When I get to do some more coding the faster userfaultfd(2) in Linux should be used. The “soft-dirty bit” mechanism might also be suitable. I wrote extensively about this here: https://medium.com/@MartinCracauer/generational-garbage-collection-write-barriers-write-protection-and-userfaultfd-2-8b0e796b8f7f https://www.cons.org/cracauer/cracauer-userfaultfd.html
- it is copying, however other passes that just punch holes into existing cards (to wipe pointers that are not pointed to anymore without moving a lot of memory) have and will be added. Conservative GC and use-from-GC are implemented by protecting a whole GC card from moving, then GCing inside it punching holes (which leaves the pinned objects in place).
This GC scheme ends up with the following properties:
- fast memory writes. There is no overhead to instructions that write to memory. Since we are using OS facilities for the write barrier there is no need to annotate each memory write with a second write doing bookkeeping for the write barrier.
- awesome performance could be had if your overall system did transient garbage during queries, resets all that garbage before the next query, as long as you don’t have older things point into it (which sadly is not what QPX is).
- Looking at memory management overhead during GC time and non-GC time you will see that non-GC code suffers very little (just some updates to VM protection for GC cards, and that can be tuned via granularity). Fast non-GC time can be a huge advantage for query-oriented systems, because you can do GC between the queries, so that customer visible queries do not add latency from GC. You can also do fuller GCs between the queries and quick GCs during and still win. Of course overall CPU consumption for the workload of query and non-query activities goes up when you play these games.
- Write barrier GCs that maintain their own bitmaps via annotating writes have the flexibility to teach the system more tricks and do more bookkeeping during non-GC code, slowing down queries some more but save overall real or CPU time. Users of the VM pages protection system have little to play with, but userfaultfd(2) will give SBCL something to play with.
- SBCL’s GC can potentially run multi-threaded, but not concurrently. That means it needs to stop the world, then could use multiple threads and CPUs to do GC (unimplemented), but it cannot do GC in the background while the rest of the world is running. Generally you need read barriers to do this. It is unknown to me at this time whether VM page protection schemes can implement read barriers that are sufficient for concurrent GC, and if so whether the unavoidable granularity allows for sufficient performance. These are equivalent with Java’s 3 modes of GCing. The JVM uses annotated writes with lots of tweaks for its write barrier.
Other memory management in SBCL:
- SBCL compiled code does not use an allocation function for most allocs. Only large allocs or allocs hitting the end of an alloc area call a functions. All memory allocation in between is doing by atomically incrementing a free-space pointer, multiple threads using the same allocation area and the same free-space pointer (that works because they use compare-and-swap to get their alloced place).
This is potentially tricky to preserve, but I would like to keep it. The speed at which some garbage can be generated is enormous, and it all adds up.
SBCL’s GC is copying (moving) by default, which enables this scheme. There is no fragmentation. Just just blast new objects into place in a simple one-after-another manner.
There is no malloc-like functions which has to keep track of fragments to maybe fill them, to maintain heap pools in lists or other collection classes. A malloc function for all practical purposes needs to either use thread-specific heap pools or use thread locking. Doing a malloc in C is several orders of magnitude more expensive than in SBCL code. More so in multithreaded code. In SBCL code there is a lock-free way to generate a bunch of stuff on the heap with no function calls.
There also is no zero-intialization of allocated space. Common Lisp does not allow uninitialized variables, so the allocated space is overwritten before use with the initialization value (as opposed to first zeroing it, then writing again with the initialization value).
All this makes various LLVM mechanism look slow, namely LLVM usually expects you to go through functions for allocation, and it zeros memory.
Compiler properties of SBCL:
- no aliasing of pointers in generated code (we come to that later)
- a write barrier implementation using a bitmap was available as a patch for SBCL by Paul Khuong @pkhuong. I benchmarked it against the VM page implementation. The untweaked real-world application did not show overall performance differences. It obviously was dominated by doing the actual GC work (especially memory scans) and it didn’t really matter how you do the write barrier, and both mechanisms apparently did an equal job of excluding GC cards from scavenging. I might be able to dig up some numbers about the resulting compiled code size. I have SBCL versions of that time around if you want to play with those two.
- very decent stack allocation abilities. This did tremendously reduce overall heap management time for my toy at work. Unlike Golang it is not possible in Lisp to automatically determine stack-allocatablity for nontrivial code, you have to tell the compiler (like in C) and hope you are not wrong (the Lisp language would allow us to use safe-mode compilation to actually make this detectable during regression tests, but this has not been done).
To iterate a bit on the basics: CMUCL and SBCL are native compilers. They compile to machine code, they do their own code loading and starting. They do not use OS compiler, assembler or linker (except for the executable image). CMUCL/SBCL have various transformation phases into Lisp-like abstracts, then an abstract machine language, then a very simple and linear step from abstract machine language to the requested architecture’s binary code.
My interest in LLVM results from that last step. No optimization is done during that last phase (abstract assembler to machine assemble), and no optimization happens on machine code. The resulting code can look repetitive and wasteful when human-reading it. LLVM is precisely what I need: an assembly-to-assembly optimizing machine. If I could just replace the last 1 or 2 steps in SBCL with LLVM I would get much better looking machine code.
Now, it is debatable and was debated inside Google whether the current state of affairs is actually doing much harm. Code is just a little bit bigger, I was more concerned about the needless instructions. To test this I made C functions that did roughly the same as my Lisp functions, then compiled C to assembler, edited the assembler file to be wasteful the same way that SBCL’s last stage is wasteful, and benchmarked. Modern x86 CPUs blast through a couple of needless or inefficient instructions with amazing speed. Since the storage dependencies are obviously quite favorable in the instruction stream (we are talking about repeated manipulations of same locations, using the same read data), the difference was barely noticeable. I had decided against “selling” this huge project to be done at work. The outcome is uncertain. Now I’m on my own time and I want to know more.
LLVM GC requirements:
- LLVM might alias pointers, calling them “derived pointers”, as part of optimizations. A moving GC that needs to find all pointers to an object that it moved must be informed of such copies and where they live, so that the copy can be adjusted, too. This is ordinarily not difficult if you just place the copies in the heap (where scavenging will find them), but that isn’t necessarily done that way (would require wrapping them). This needs thinking over.
- What is worse is that such derived pointers might not actually be copies of pointers to (the beginning of) an object, they might point into the middle. The existing SBCL GC has some facilities to deal with that, but it is a pain and a slowdown. You need to determine the surrounding object on scavenging.
- LLVM is language-neutral and naturally assumes that in a mixed language system (say Lisp and C) both languages can call each other freely. That is a change from SBCL where calling Lisp from C is only supported if you wrap it into routines that tell the system about it (e.g. so that Lisp objects known to be pointed to by C are not moved, this is simply folded into the “slightly conservative” mechanism).
What I don’t pay for now (in SBCL):
- In SBCL right now I am not forced to make memory writes more complicated. I would like to direct the LLVM facilities for pointer aliasing to retain that. Depending on how userfaultfd(2) works out I might switch to a bitmap scheme for the write barrier, but right now I expect the VM page approach to be better for me, given proper OS support. I would not like to make memory writes slower.
- My systems are Lisp-based. C is used in an auxiliary manner. I rarely give out pointers into the Lisp heap to C. When I do there is a straightforward mechanism available to protect such data from moving. Safety of that mechanism could be improved with SBCL, such as VM-protecting Lisp heap areas against reads when such area is vacated by the GC. I would not like to pay a code speed price for automatic mechanisms to deal with this, especially not a speed price during non-GC code.
- Even if I were to learn to like bitmap write barriers better than I do now, VM protection games are used elsewhere such as for guard pages and other safety. You can easily end up in a situation where some overhead for GC and safety can be shared, thereby reducing the cost of a VM based GC optimizations.
LLVM’s current facilities allow you to write a good garbage collector. However, your choices to pick specific tradeoffs are limited.
It is very difficult to estimate how this works out in practice, since LLVM does not have a language right now (to my knowledge) that optionally generates code as compact as C and does garbage collection, the way SBCL does. Benchmarking GC schemes for total time, much less for mainline versus GC time overhead, cannot be done in languages that use heap memory in a lot more places than C or SBCL-generated code do.
It is very rare to see two equally tuned GCs in the same language implementation (no, the JVM’s three garbage collectors are just one, with a couple options on how to run it). The requirement that most fast GCs need mainline code to do some bookkeeping makes it very annoying to switch between GC, not to mention correctly implement them both in the first place.
In the case of SBCL. Well, as people know my pet project is to use LLVM as a replacement for SBCL’s last stage, which is abstract assembly to machine assembly. That stage does no optimization in SBCL right now and LLVM is all about those specific optimizations. I got discouraged about the potential speed gains when I tested hand-written assembly code to do the stupid things SBCL does in its last stage versus code that does not. The result of those tests was that modern CPUs don’t care. They blast through the slightly inefficient assembly code almost the same as through the optimized code. Obviously, the optimizations I miss in the last stage of SBCL and that LLVM has are to a significant extent implemented in the CPUs, too.
(I need to clarify that this only works for optimizations past program flow rearrangement, such as when you already have abstract assembly and generate machine assembly. All previous optimization stages are critical for speed and the CPU’s smart dependency graph games cannot help if you just run through a lot more code on a scale that isn’t covered by the microarchitecture look-ahead. For example, needlessly using a lot of heap space will always kill you, no matter how long you hold that heap memory)
Overall I left this with mixed feelings. LLVM’s GC facilities are not bad, and the restrictions of what you can do as far as GC design is concerned are not unreasonable from the perspective of a generalist.
At the same time I am sure that if you had ran ITA/Google’s QPX engine with the restrictions LLVM imposes through the time that I was part of providing services on it (2001–2017) you would see a lot of grief, grief that is unhealthy for startups. Memory management on all levels (heap, stack, VM, RAM) was a key factor in ITA’s success. Thanks to the airlines filing lots of crazy fare rules we needed the ability to bend memory to our will.
Appendix 1 — leaky GC
I just noticed that I didn’t speak about precision and even correctness.
Unwanted retention: Obviously, a GC that is in any way conservative has a risk of holding on to memory when it should not. An integer that floats around in a conservatively scanned area that could be a pointer and you are there. Theoretically this could hold down very large amounts of physical memory. Imagine that such a false-alias integer points into a connected graph of connected size. That as such is not too unlikely, especially in a query-oriented system. Imagine you build a large graph on each query and want to discard it at the beginning of the next query. If you have a conservative GC nail down such a graph from a previous query on a regular basis you could multiply memory usage.
On the other hand, in a 64 bit VM address space you can move the Lisp heap high enough that false aliases become less likely (integers in computer systems are statistically much more likely to be small). Keep in mind SBCL is not heap-to-heap conservative. The large heaps are not sources of possible conservative pointers (neither the Lisp nor the C heap). Stacks are the biggest source for SBCL. This is in contrast to e.g. the Boehm collector for C, which is always fully conservative, heap-to-heap and stacks, registers etc, too.
As my friends know, SBCL had a misfeature in that nailing memory down from one conservative pointer could spread to 4095 more pointers (all in a GC card), which caused some trouble for my toy at work and lead to a talk on the ELS 2015 Lisp conference in London.
Losing memory: Don’t laugh, from my perspective of query-oriented systems and being used to fact cloning of processes (fork(2), or quick start from mmap(2)ed Lisp images) I can imagine a use for a GC that sometimes collects too much. It would have to bribe me with some major advantages in speed or somesuch of course. The way it would work is that freed memory is replaced with memory areas that are not mapped, giving me a segfault when something tries to access memory that is now incorrectly gone. That segfault is the same thing that SBCL now uses for keeping track of generations that have not been modified since they were last scanned and marked readonly. It is not an exotic concept, virtual machines use it, too. It is very expensive right now, this is where userfaultfd(2) should help. https://www.cons.org/cracauer/cracauer-userfaultfd.html
Anyway, so I get notified that I had a piece of code that had a leftover pointer to a piece of data that should have been preserved but was not. I can decide how I want to handle that. Imagine again that I have a system that is (a) query-oriented and (b) creates new instances by fork. I can simply through away the child and restart the query in the parent (or in a new child), this time with GC told to be precise. Depending on how much speedup you get from the fast-n-loose GC this can very well pay off. If you have plenty CPU you can also start fast and slow version at the same time, using the slow version’s result if the child doesn’t work out, or cancelling the slow version when the fast one made it.
Think of it as a systemwide option to use weak pointers, rarely, when they would create speed, and weak pointer reactions are either done by the system (as opposed to the code directly using the pointers), or code that can deal with the situation can do as it sees fit. You would fine tune that a bit, e.g. you probably don’t want to throw away code that is still in use.
Reference picture of a SBCL GC cards: