Generational Garbage Collection, Write Barriers/Write Protection and userfaultfd(2)

In a generational garbage collector you have opportunity for an optimization. If you know that certain parts of the heap cannot point to younger generations you can skip scavenging those parts when GCing those younger generations. Now, before we jump in too deep, let’s clarify terms.

First of all, I’m doing this for SBCL, which is a to-native-code Common Lisp compiler. [on x86 and amd64] it uses a generational GC relying on VM page protection for optimization. SBCL derived from CMU’s Common Lisp system. I use CMUCL/SBCL since ca. 1993 for myself and at work, including for ITA software’s QPX flight search engine for the last 15 years.

Let’s move on to collecting garbage. (UPDATE: since this has generated a lot of interest I will update this with links to introductions, and a few diagrams. I added the basic in a few new paragraphs below)

Iterating over the heap comes in two “variants”:

  • “collecting” a part of the heap refers to that part of the heap that you currently want to remove unused parts from. In a generational GC that is only a part of the heap, only one of several generations.
  • “scavenging” a part of the heap is an actual pointer-by-pointer iteration over the entire (subject to optimizations) heap. You go through the entire heap to look for pointers to the generation you are correctly GCing.

The way it works is that you “scavenge” the entire heap (and all other possible root of pointers to objects such as stack, registers etc) and if a pointer goes out from your scavenged part to the “to-be-collected” part of the heap then you know that pointed-to-object is in use. It gets quite complicated in the details such as the to-be-collected space needing to collect itself etc. There are various ways to then use that “in-use?” information to free up unused space after scavenging. Common ones are moving all valid objects into a new area (adjusting the pointers, obviously that doesn’t work in C) and throwing the old area away; to punching holes by wiping unused data so that it can’t point elsewhere, and obviously you should consider reusing those holes. What to do after scavenging isn’t part of this writeup (I give conference talks on that but I don’t think there is a video). This document describes the cost associated with a common optimization in generational GC, the tradeoffs of different approaches, and how the OS kernel can help save time and memory bandwidth.

Generational GC means you only “collect” a fraction of the heap, but not the rest of the heap. After surviving one collection such data is “promoted” to the next generation, so your different generations have increasingly old data, with young data being in the most frequently collected area until it has proven to be somewhat persistent. But you still have to “scavenge” the entire heap even though you only “collect” a part.

So if you want to GC (“collect”) generation number 5 you have to “scavenge” generations 0–4 (and 5 against itself in a recursive manner). This goes contrary to what is sometimes assumed to happen in generational GC. You “scavenge” all generations, not just the one you currently want to GC, although you “collect” only one. The first speedup in generational GC comes from the youngest generation being more likely to have discarded objects (objects with short life are much more common) and from not copying around data from the other generations. Those older generations you only scavenge but don’t collect, so data in there doesn’t move. This is assuming that you have a copying GC that compacts the generation that is GCed into new GC cards. You can also have a “hole-puncher” leaving live objects in place and doing something else to the dead objects.

Now, there is another optimization here. If you are collecting generation 5, and you know that generation 2 only has pointers to generations 0–2, but no pointers to 5, then you only have to scavenge generation 3–4 (and 5). You can skip scavenging generation 0–2. You can make that determination during an earlier full(er) GC pass where you did scavenge generation 2 and found it had no pointers to younger generations.

GC cards

That is nice but it is unlikely that an entire generation is free of such pointers to younger generations. To improve this you can fragment generations into smaller parts. I call them “GC cards” after the term in CMUCL/SBCL. Some call them “GC pages” but I don’t like that term since it can be confused with VM pages. In many implementations the size of a GC card equals a VM page but that isn’t the case for anything I work on.

So you have those GC cards, fragments of the heap. The size is typically 4 KB at a minimum (which is a common VM page size), my toy at work uses 32 KB. What the best size is depends on how you implement the rest of this document. If you have a standard benchmark for your project it is easy to try a couple values and find a local minimum. I am also thinking about a slightly variable size for GC cards, but more about that later.

When old generation 2 has been scavenged the last time you remember (for every GC card inside that generation individually) whether it has any pointers to younger generations at that time. Later on the next GC you can skip scavenging those GC cards — if you can be sure that they *still* don’t have pointers to younger generations. The trick now is to get notification when that card is being written to by regular (“mainline”) code between GCs.

There are two major ways of getting write notifications:

  • have the compiler generate code that associates every write to a pointer with a second write that marks the GC card the pointer is in (as opposed to what the pointer points to) as “dirty”.
  • use the VM system’s write protection to get a notification from the kernel. After determining that a card did not have pointers to younger generations in the previous scavenge you mark the page as readonly with mprotect(2). When mainline code writes to it that will give a segmentation fault, which you can catch, mark the card as dirty, change VM protection to read-write and continue.

Both schemes typically use one boolean per GC card, simply saying “points to any younger generation yes/no”. More specific mapping of pointer properties is possible but doesn’t matter for this discussion.

Upon reading these two options most people instantly think of one way as “obviously faster”. People with a programming language/compiler (but not GC) background often cringe at the additional writes. People with an operating system background often cringe at more system calls and VM map modifications. To open evaluation this way, Paul Khuong @pkhuong provided a patchset to SBCL that uses the write-bitmap scheme, whereas SBCL usually uses VM protection. Although the properties of the two runs were very different the total runtimes for larger workloads for my toy at work were virtually identical. It’s all in the details, apart from the obvious domination of memory bandwidth on GC time (unless you get more things right than I currently do at work).

The basic properties of the two solutions are this:

  • the write-bitmap scheme adds a second write to every write. That is expensive. The second write path also nails down at least one more L1 data cache line. It makes the code bigger. You generally cannot have variable GC card sizes with this (as the write path can’t conditionalize without further slowdown). If your bitmap is actually using single bits then you have a read-modify-write cycle, and bit operations are relatively more expensive on modern CPUs that do flatter code much faster. If your “bit”map is on words instead then your bitmap can get very large for big heaps with small GC cards, with the associated problems of using even more L1 data cache lines, killing TLB entires etc etc. If you use large GC cards instead you are scavenging more (since every card has a much higher chance of being “hit” by a single write which sentences the entire GC card to be scavenged). Using large GC cards offers no speedup of mainline code (the non-GC “mainline” code, subject to me not knowing about optimizations here).
  • The VM page protection scheme’s weakness is the load on the hardware and the OS. System calls to change protection on VM pages are expensive for the application. Changes to the VM map might have to be propagated to other CPUs. They invalidate the TLB. Using a SIGSEGV as a general purpose mechanism is very, very expensive. The kernel builds up a lot of stuff in preparation to delivering the signal. It is insane, especially on Linux which provides a very rich environment to a userland signal handler. Linus most certainly doesn’t look favorably on using this for non-exception code. Then there are multiple kernel/userland context switches. You use a separate execution environment with a separate stack blah blah blah. It is terrible. However, once a card has been marked as “was clean and now is not” you will not do any of that again until the next GC, for that card. Obviously larger cards will speed up mainline code (non-GC code) in this scheme as no single card will be marked twice.
  • Using VM page protection also has limitations. A GC card cannot be smaller than a VM page. x86 and am64 offer 4 KB and 2 MB as usable page sizes. 4 KB is too small, you get flooded with system calls. That is why my toy at work (and subsequently SBCL) use 32 KB, always changing 8 VM pages of 4 KB each at a time with a single mprotect(2) system call. Using hugepages (2 MB) is problematic since you now mark 2 MB of space for future scavenging even if only a single pointer in there ever pointed to a younger generation. On the upside since you rarely invoke the protection mechanism you can use different GC card size over the heap (messing around with the code indirections to find stuff is affordable in this rarely taken code path). I will probably change to a larger card size for young generations and a smaller one for older generations.

Let’s compare a bit more.

Using VM page protection isn’t quite as terrible as it looks since you will at a maximum invoke the “I wrote to that card” machinery once between GCs for each card. For cards that are and stay free of pointers to younger generations you have no overhead at all from this scheme. The bitmap mechanism on the other hand never stops writing the bitmap even if the card is already marked either way (there might be optimizations I am not aware of).

My toy at work is a query oriented system (for the most part). We have the option of only GCing between such queries and not GCing during queries. Or we can proactively GC between queries and although we don’t forbid it have a lot less GCing to do during the query.

Both schemes slow down code paths with no GC (“mainline code”). The VM protection mechanism slows down in mainline code when a card needs to be marked as “written to”, and that is expensive because of said system calls, signal delivery and VM map propagation requirements which all happen outside GC, too. Seriously, if you were to heatmap your system when that happens you’d see hardware pieces glow in pretty colors all over the place. GC itself is more expensive in the VM protection mechanism since you need to set page protection for GC cards you find to be free of pointers to younger generations (but not for cards that were free of them before and after).

GC itself is not slowed down in any particular way during the bitmap scheme, but the overhead from the secondary writes always applies during mainline code. You always have that secondary writer going.

As a result, if you have a system like my toy at work you gain a relative advantage for the VM protection scheme since we care about query latency. The mainline code runs faster (especially with large GC cards) and we either don’t care at all about GC time or at least we price between-query time as less expensive than during-query time.

Now, there is a new kid of the block. userfaultfd.

Userfaultfd is a new mechanism currently implemented in the Linux kernel which gives an API to be notified about VM events. The functionality currently implemented in the -mm kernel is sufficient for what we need for a VM page protection GC helper, and also for other tricks such as on-demand paging from compressed heap image pages.

I have an example program demonstrating both mechanisms here: https://www.cons.org/cracauer/cracauer-userfaultfd.c

I also wrote some clarification for userfaultfd’s documentation here: https://www.cons.org/cracauer/userfaultfd-doc.diff

If I ever find out how to push to the -aa kernel I’ll submit the doc diff. [UPDATE: merged to aa tree]

Userfaultfd went through a lot of changes before release. It started out as a simple thing using the madvise(2) API with minimal additions. I would have liked that since it would have minimized the number of system calls I’d have to make for my GC. The API got pretty complicated. And as you can see in my example program requires more system calls now. This has been done to satisfy a variety of needs. The people pressing for userfaultfd the most are virtual machine hackers which use this to transition running virtual machine instances between physical machines. To my knowledge I am the only one currently testing userfaultfd for GC purposes. I was very happy to report that everything I needed worked fine in a low-stress system. The known bugs would mean a slowdown for me (false positives), not break me. I didn’t find any unknown ones. Tricky parts are around paging and I generally don’t page.

I did not have time to hack up the SBCL GC to use it since I have a more pressing project. From what I can see this API is still much faster than using mprotect and SIGSEGV. Sending and handling SIGSEGV is insanely expensive, at least in Linux. You can fit a lot of the new required ioctl(2) and read(2) system calls in that budget, and my understanding is that propagating changes to notification requests in userfaultfd is cheaper than propagating mprotect(2). Not sure why that is supposed to be, but we’ll see Real Soon Now(tm).

On BSDcan 2016 I had a quick look at both FreeBSD and OpenBSD with a couple people. The simpler VM system (compared to Linux) seems to have reasonably straightforward entry points to make a partial implementation of userfaultfd, at least the part I need for GC. That’ll be a bit of an uphill battle since I would have to take more bits in VM system kernel structs that people are highly protective of. And I would have to brain-shuttle my train of thought how I arrived at being happy with the much more complicated API when everybody’s first thought is to just hack up madvise(2) a bit.

I plan to wrap all this into a C library (libwritenotify.so or something) to use userfaultfd when available and mprotect when not. Remember than a binary can be compiled on a machine that has the header files and the libc for userfaultfd but the kernel that binary runs on might have compiled it out, or you run that binary on a different machine.

Random appendixes.

Clarifying points frequently misunderstood.

  • GC cards are generally numbered and looked up by virtual address. A higher number GC card usually has a higher virtual address.
  • on the other hand the different generations in the heap do not have contiguous virtual addresses. Any random GC card can end up in any random generation. In any random order inside that generation (now that I’m mentioning it, I should probably sort this in SBCL before scavenging).
  • Typical numbers: VM page 4 KB or 2 MB. GC card 4 KB to 32 KB. Number of generations 6 (toy at work runs fewer since there is a sharp U shape of long-living and short-living data). Size of total heap can be gigabytes but not all of the Lisp heap is GCed heap. Optimizations are done such as data from the on-disk heap image not being GCed, using data in Lisp-usage layout in directly mmap’ed areas and the like. The free-floating (not file backed) Lisp heap should probably be kept under a gigabyte for a heap that is moving data in the to-be-gced generation at least. I can’t betray any overly specific numbers from work here since this varies a whole lot depending on how we are torturing QPX today.
  • GC time percentage of total runtime in real-world programs varies a lot. 10% is a good number to give as a general idea. That sounds bad, but you have to keep in mind that using non-GCed heap isn’t free either. In a non-moveable-heap language like C you have no choice than to deal with fragmentation by finding previously free(3)ed holes and giving them out to new malloc(3) requests. That usually involves large amounts of machinery including hashtables or other search, and that ensures that malloc(3) in practice is always a full function call. To compare, a simple memory allocation for a small object in SBCL is only a single pointer increment and a comparison to a barrier. Two instructions. Two atomic instructions. So you can produce garbage a whole lot faster than C can, and I mean at least 10x. Keep in mind that those SBCL allocation instructions are atomic, so there is no locking for multithreading and no requirements for per-thread heaps. C heap allocation often uses per-thread heaps now but free(3) in C can still be tricky in a multi-threaded environment since there is no guarantee that the same thread frees the data that malloced it. Much of this is hidden to casual profiling and hard to measure and visualize, for all languages and heap management styles.
  • As I mentioned above, mainline code (not memory management related) is slowed down by both schemes for GC write notification support. But in C mainline code can suffer from fragmentation, whereas data in moving-heap languages is allocated in sequence, and later compacted again into sequence. It isn’t clear and not easy to research what exactly the slowdown here is for GCed and non-GCed languages . I learned long ago not to knee-jerk assumptions here.
  • There also is the soft-dirty bit mechanism in the Linux kernel. I am not enthusiastic about it since it doesn’t invoke any user (my) code. I wouldn’t be able to play additional tricks with pages (compression, sharing) and, more importantly, I can’t use it to gather useful statistics. For example, nothing keeps me from having a userfaultfd(2) handling thread be called on every write access to a page or card, not just the first write. It also gets the fault address (the word, not just the page). That way, if you take the performance hit for the measuring, you can draw a picture about which GC cards got “hit” with writes and hence need to be scavenged with how many writes into which words and you can find out the type of objects in the SBCL heap, just knowing the address.

I should also clarify about multithreading:

  • both VM page protection based write barrier and compiled-code/bitmap write barriers can be used when implementing a multithreaded GC. That means a GC where the mainline code threads still stop, but multiple CPUs (or multiple memory banks) work on GCing during the pause.
  • however, only the bitmap-writing scheme can be used to implement concurrent GC. Concurrent GC collects garbage at the same time as mainline code is running. While it cannot be free of pauses the system will generally run without GC pauses except for very small hickups. (this paragraph is subject to me not knowing about some smart thing to overcome this problem for the page protection scheme)
  • I believe but have been too lazy to proof so far that the VM page protection scheme has an edge for multithreaded GC since you don’t have to worry about the bitmap. I am not sure whether you need locks to deal with the bitmap, but again, writing into one more place on a contiguous basis costs cache lines and TLB entries, and you do that during GC where you need memory bandwidth the most. I also want to multithread GC so that the being-scavenged GC card is always scavenged using a thread running on a CPU that uses the same NUMA memory bank that the GC card is in. Obviously if your associated bitmap is splattered across the NUMA memory banks then it gets expensive both for reading and for writing with propagation to the other bank. If you put the bitmap where the GC card is, physically, then you don’t have a virtual-address contiguous large chunk of heap which would complicated SBCL allocation.

Random useful and interesting links (thanks, Stelian):

JVM using mprotect although it normally uses bitmap marking: http://jcdav.is/2015/11/09/More-JVM-Signal-Tricks/