Generational Garbage Collection, Write Barriers/Write Protection and userfaultfd(2)

  • “collecting” a part of the heap refers to that part of the heap that you currently want to remove unused parts from. In a generational GC that is only a part of the heap, only one of several generations.
  • “scavenging” a part of the heap is an actual pointer-by-pointer iteration over the entire (subject to optimizations) heap. You go through the entire heap to look for pointers to the generation you are correctly GCing.
  • have the compiler generate code that associates every write to a pointer with a second write that marks the GC card the pointer is in (as opposed to what the pointer points to) as “dirty”.
  • use the VM system’s write protection to get a notification from the kernel. After determining that a card did not have pointers to younger generations in the previous scavenge you mark the page as readonly with mprotect(2). When mainline code writes to it that will give a segmentation fault, which you can catch, mark the card as dirty, change VM protection to read-write and continue.
  • the write-bitmap scheme adds a second write to every write. That is expensive. The second write path also nails down at least one more L1 data cache line. It makes the code bigger. You generally cannot have variable GC card sizes with this (as the write path can’t conditionalize without further slowdown). If your bitmap is actually using single bits then you have a read-modify-write cycle, and bit operations are relatively more expensive on modern CPUs that do flatter code much faster. If your “bit”map is on words instead then your bitmap can get very large for big heaps with small GC cards, with the associated problems of using even more L1 data cache lines, killing TLB entires etc etc. If you use large GC cards instead you are scavenging more (since every card has a much higher chance of being “hit” by a single write which sentences the entire GC card to be scavenged). Using large GC cards offers no speedup of mainline code (the non-GC “mainline” code, subject to me not knowing about optimizations here).
  • The VM page protection scheme’s weakness is the load on the hardware and the OS. System calls to change protection on VM pages are expensive for the application. Changes to the VM map might have to be propagated to other CPUs. They invalidate the TLB. Using a SIGSEGV as a general purpose mechanism is very, very expensive. The kernel builds up a lot of stuff in preparation to delivering the signal. It is insane, especially on Linux which provides a very rich environment to a userland signal handler. Linus most certainly doesn’t look favorably on using this for non-exception code. Then there are multiple kernel/userland context switches. You use a separate execution environment with a separate stack blah blah blah. It is terrible. However, once a card has been marked as “was clean and now is not” you will not do any of that again until the next GC, for that card. Obviously larger cards will speed up mainline code (non-GC code) in this scheme as no single card will be marked twice.
  • Using VM page protection also has limitations. A GC card cannot be smaller than a VM page. x86 and am64 offer 4 KB and 2 MB as usable page sizes. 4 KB is too small, you get flooded with system calls. That is why my toy at work (and subsequently SBCL) use 32 KB, always changing 8 VM pages of 4 KB each at a time with a single mprotect(2) system call. Using hugepages (2 MB) is problematic since you now mark 2 MB of space for future scavenging even if only a single pointer in there ever pointed to a younger generation. On the upside since you rarely invoke the protection mechanism you can use different GC card size over the heap (messing around with the code indirections to find stuff is affordable in this rarely taken code path). I will probably change to a larger card size for young generations and a smaller one for older generations.
  • GC cards are generally numbered and looked up by virtual address. A higher number GC card usually has a higher virtual address.
  • on the other hand the different generations in the heap do not have contiguous virtual addresses. Any random GC card can end up in any random generation. In any random order inside that generation (now that I’m mentioning it, I should probably sort this in SBCL before scavenging).
  • Typical numbers: VM page 4 KB or 2 MB. GC card 4 KB to 32 KB. Number of generations 6 (toy at work runs fewer since there is a sharp U shape of long-living and short-living data). Size of total heap can be gigabytes but not all of the Lisp heap is GCed heap. Optimizations are done such as data from the on-disk heap image not being GCed, using data in Lisp-usage layout in directly mmap’ed areas and the like. The free-floating (not file backed) Lisp heap should probably be kept under a gigabyte for a heap that is moving data in the to-be-gced generation at least. I can’t betray any overly specific numbers from work here since this varies a whole lot depending on how we are torturing QPX today.
  • GC time percentage of total runtime in real-world programs varies a lot. 10% is a good number to give as a general idea. That sounds bad, but you have to keep in mind that using non-GCed heap isn’t free either. In a non-moveable-heap language like C you have no choice than to deal with fragmentation by finding previously free(3)ed holes and giving them out to new malloc(3) requests. That usually involves large amounts of machinery including hashtables or other search, and that ensures that malloc(3) in practice is always a full function call. To compare, a simple memory allocation for a small object in SBCL is only a single pointer increment and a comparison to a barrier. Two instructions. Two atomic instructions. So you can produce garbage a whole lot faster than C can, and I mean at least 10x. Keep in mind that those SBCL allocation instructions are atomic, so there is no locking for multithreading and no requirements for per-thread heaps. C heap allocation often uses per-thread heaps now but free(3) in C can still be tricky in a multi-threaded environment since there is no guarantee that the same thread frees the data that malloced it. Much of this is hidden to casual profiling and hard to measure and visualize, for all languages and heap management styles.
  • As I mentioned above, mainline code (not memory management related) is slowed down by both schemes for GC write notification support. But in C mainline code can suffer from fragmentation, whereas data in moving-heap languages is allocated in sequence, and later compacted again into sequence. It isn’t clear and not easy to research what exactly the slowdown here is for GCed and non-GCed languages . I learned long ago not to knee-jerk assumptions here.
  • There also is the soft-dirty bit mechanism in the Linux kernel. I am not enthusiastic about it since it doesn’t invoke any user (my) code. I wouldn’t be able to play additional tricks with pages (compression, sharing) and, more importantly, I can’t use it to gather useful statistics. For example, nothing keeps me from having a userfaultfd(2) handling thread be called on every write access to a page or card, not just the first write. It also gets the fault address (the word, not just the page). That way, if you take the performance hit for the measuring, you can draw a picture about which GC cards got “hit” with writes and hence need to be scavenged with how many writes into which words and you can find out the type of objects in the SBCL heap, just knowing the address.
  • both VM page protection based write barrier and compiled-code/bitmap write barriers can be used when implementing a multithreaded GC. That means a GC where the mainline code threads still stop, but multiple CPUs (or multiple memory banks) work on GCing during the pause.
  • however, only the bitmap-writing scheme can be used to implement concurrent GC. Concurrent GC collects garbage at the same time as mainline code is running. While it cannot be free of pauses the system will generally run without GC pauses except for very small hickups. (this paragraph is subject to me not knowing about some smart thing to overcome this problem for the page protection scheme)
  • I believe but have been too lazy to proof so far that the VM page protection scheme has an edge for multithreaded GC since you don’t have to worry about the bitmap. I am not sure whether you need locks to deal with the bitmap, but again, writing into one more place on a contiguous basis costs cache lines and TLB entries, and you do that during GC where you need memory bandwidth the most. I also want to multithread GC so that the being-scavenged GC card is always scavenged using a thread running on a CPU that uses the same NUMA memory bank that the GC card is in. Obviously if your associated bitmap is splattered across the NUMA memory banks then it gets expensive both for reading and for writing with propagation to the other bank. If you put the bitmap where the GC card is, physically, then you don’t have a virtual-address contiguous large chunk of heap which would complicated SBCL allocation.

--

--

--

Writing about programmer productivity, teambuilding and enabling a wide variety of different engineering personalities. CTO at www.thirdlaw.tech #lisp #freebsd

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Martin Cracauer

Martin Cracauer

Writing about programmer productivity, teambuilding and enabling a wide variety of different engineering personalities. CTO at www.thirdlaw.tech #lisp #freebsd

More from Medium

Optimizing GC time by reducing Humongous allocations

Builder pattern

Camel Powers

Is Gradle hard to learn ?