Explaining garbage collection algorithms by keeping your fridge tidy
Learn some commonly-used GC algorithms, work to manage your leftovers.
In every place I’ve lived, I’ve had the same main responsibility: cleaning out the fridge. This life calling began around middle school and my skills developed in parallel with my ability to assemble a meal consisting only of leftover takeout and the last dregs from condiment jars. By the time I’d made it to college and begun advancing my study of computer science, this ability to conjure sustenance from unlikely combinations made the multi-day marathon stays in the laboratory sustainable (if less tasty). Plus, my living spaces were always pleasantly free of the refrigerator science experiment odors which plagued neighboring dorm kitchens, something which my roommates appreciated. Sure, the process of cleaning things out often got contentious — as mocked by George Carlin’s now famous skit, The Ice Box Man — but, on the macro level, it was mostly me just eating leftovers.
However, like all science geeks fascinated with how the world works, I eventually asked myself the question of “can I do this more efficiently?” and was pleased to find the answer: a resounding yes. In fact, cleaning out fridges provides an excellent metaphor for explaining the family of programming’s automatic memory management tools known as garbage collection.
What is garbage collection?
Unexplored by many who consider automatic memory management the purview of compiler implementors and language designers, garbage collection is the process by which a program’s unused memory gets automatically reclaimed. This can happen via either the environment (like the JVM) or by the language itself (like in Lisp). In contrast to programs written in languages like C and C++ — whose source code can often resemble a battleground of malloc, free, new and delete calls — garbage collected languages free programmers from such mundane chores as the allocation and reclamation of memory segments.
Just like a cook putting a dish in the refrigerator, programmers regularly stash values in data structures of various types. If you want to hold on to information during program execution, you need memory in which to store it. Once the program has a block of memory, the organization of data in the block itself — whether stored in a hash table, array, tree or some programmer-defined type — is immaterial. If managing memory manually, the program author takes responsibility for returning blocks of memory to the system when no longer needed. Failure to do so leads to memory leaks, situations wherein a program keeps gobbling up memory without giving any back. Programs which leak memory in this fashion run the very real risk of consuming all available memory and getting killed, usually with an Out Of Memory (OOM) error. Compare this with garbage collected memory wherein the collector will try to identify those regions of memory which the program will no longer use and return them to the system.
Kitchen as computer, refrigerator as memory
Following in computer science’s longstanding tradition of imperfect metaphors (a monad is like a burrito!), let us introduce the idea that, at a high enough level, both kitchens and computer share many operational semantics. A computer has an operating system kernel while a kitchen has a chef. A computer possesses hardware components (CPU, memory, hard disk, network interface) for computational tasks while a kitchen boasts a variety of appliances (stove, oven, freezer, blender) used for food preparation. As you might guess, we will draw a parallel between a cook’s usage of a refrigerator for food storage and a program’s usage of memory regions for storing data.
Tracing garbage collection
Tracing garbage collection strategies underly such a broad category of algorithms that many refer to tracing and garbage collection interchangeably. In it’s purest form, tracing garbage collection refers to contiguous blocks of memory as objects and involves very few steps:
- The collector traverses the memory graph by following all references.
- For any object in memory visited by the traversal, the collector deems it to be reachable and therefore live.
- Any object in memory not visited by the traversal must not have any references pointing to it so the collector considers it unreachable and dead.
- The collector condemns dead objects and their space become eligible for reclamation.
Typically, tracing garbage collection runs periodically in collection cycles. Depending on the underlying collector implementation, different phases of work may happen on different collection cycles but, in the naive implementation, the collector would identify dead objects and reclaim them in the same cycle.
Imagine going through every shelf in your fridge and, for every thing you find, you scream out “Hey! Does anyone want this?”. If someone hollers back “Yeah, I’ll eat that with my lunch tomorrow!”, then you can reasonably assume that that orange (or whatever) should stay and it’s continued presence merits the use of real estate. Otherwise, you throw it out and make space in your fridge.
You realize this approach causes you to continually unpack and repack everything so you decide maybe you can break things up into stages rather than doing everything at once.
Mark and sweep
In 1960, John McCarthy, the father of Lisp and progenitor of garbage collection, broke up collection cycles into two parts: the mark phase and the sweep phase. The mark phase consists solely of using tracing to determine object reachability, leaving behind an actual mark (often just a toggled bit) on every visited (live) object. Then, during the sweep phase, every unmarked (dead) object gets reclaimed. Since traversing reference trees and toggling bits during the mark phase incurs significantly less cost than actually freeing memory, a collector could opt to delay sweeping until convenient.
This equates to you simply reaching around and putting a sticker on everything someone wants. Later, when the kitchen is quiet, you can throw out stuff without a sticker.
Having gotten hoarse from all the yelling back and forth to figure out who wants the day-old takeout, you decide to farm out work to others.
A reference counting garbage collector keeps a dedicated count of references to an object, incrementing and decrementing the count as references are established and forgotten. When the counter reaches zero, the collector declares the object dead (condemned) and ripe for reclamation. This requires the added step of managing the reference count as part of initializing or destroying a reference which can cause problems. For example, if a given object gets passed around with great frequency, this could result in performance overhead as the counter changes often. In a multi-threaded program, multiple threads continually trying to modify the same counter could require counter synchronization, increasing program complexity and introducing the possibility of lock contention. Not to mention the problems of needing to maintain counters somewhere (itself using more memory) and the possibility of needing to increment counters beyond the size allocated for them.
Back in your kitchen, you finally got fed up and bought all of your housemates their own sheet of stickers, telling them that if they want to hold on to something for a future meal, they should leave a sticker on it. When they no longer want something, they peel off the sticker and you throw out every stickerless item. Unfortunately, the discussion about who wants what leads to a realization that everyone can pool resources and share food to save money. This leads to a constant flurry of activity and bumped elbows at mealtime as everyone jockeys for a chance to move their stickers around. Might make them a little nuts and slow down snack retrieval but it sure means less work for you!
Generational garbage collectors
Another type of garbage collection algorithm, called generational collection, relies on a principle called the generational hypothesis. This hypothesis relies on the idea that the longer an object stays alive, the more likely it is to survive going forward. Consider the case of buffers used to store data during processing. Such objects get used and then discarded in relatively short order whereas longer-lived objects, like database connections, might stay around for the entire duration of a program’s execution! Taking advantage of this principle, a generational collector could reduce the overhead during scanning phases (whether tracing or something fancier) by focusing more on newer objects.
Imagine a collector which puts objects into two categories, the Young and the Old, called generations. The first few collection cycles might look something like this:
- Scan all objects and condemn the dead ones.
- Reclaim all dead objects.
- All objects which have been around for awhile get promoted to the Old generation.
- For the next few cycles, the collector ignores the Old generation and focuses on the Young generation.
- Every n cycles or so, perform a complete scan.
If the characteristics of your program uphold the generational hypothesis, then a generational garbage collector could improve your GC performance greatly. Many collectors will create a few different generations, each scanned less frequently than the one preceding it.
In your fridge, you could certainly rearrange your food periodically to move older food to a different shelf but, in my home, this happens naturally! It has been my observation that condiments tend to rank among the longest-lived items. Most folks I know store condiments on the door. Items which routinely die young, like leftover pizza and half-eaten containers of Indian takeout, usually end up on the bottom shelf and disappear within a day. Vegetables and fruit go into their special drawers but get eaten up within the week. That bottle of sriracha, though, has been on the door for awhile and will continue it’s blissfully-spicy existence for quite some time.
Now, this doesn’t always work out perfectly and sometimes objects will end up getting promoted to an older generation by mistake. Think about that one bowl of rice which gets stuck on a shelf and pushed so far back that you forget about it until it turns into a science experiment. Consider, for example, a generational collector with three generations, somewhat similar to the JVM (which happens to have two extra non-heap generations used for internal matters). We’ll use the same nomenclature here, in case you want to read up further, of the Eden, Survivor and Tenured generations.
Sure, that one bowl of rice will eventually get discovered and thrown out, but in the meantime, it takes up space much longer than it should.
Why stop the world?
If you stuck with our chosen metaphor, you may have noticed that all scenarios discussed so far require physically standing in front of the fridge, thus making it impractical or inconvenient for others to access it at the same time. This very problem has led computer scientists to classify such blocking garbage collection strategies as stop the world algorithms since they require the main program itself to halt while garbage collection takes place. Often, the garbage collection process will consume more CPU than the regular execution itself, leading to undesirable resource hogging by the program. The specifics differ from implementation to implementation but, since reference counting allows for immediate reclamation if so desired, many OS kernels use reference counting for internal bookkeeping. Some modern garbage collectors, like the JVM’s new G1 collector, rely on a strategy called partitioning wherein the collector will suspend access to a limited portion of memory which could prevent the program itself from halting entirely.
I may never know why I was born with the gift of cooler cleanliness, but I will likely spend the rest of my life in fate’s chosen household role. It may further remain unclear why more people don’t have an appreciation or interest in learning about garbage collection algorithms. Afterall, it’s as easy as cleaning out your fridge.
Aside from the obligatory link to the Memory management article at The Wikipedia, if this topic interests you, I highly recommend the resources at MemoryManagement.org which provides great material regarding the differences between types of memory management as well as an exceptional glossary. For the die hards, there’s always The Garbage Collection Book by Jones, Hosking and Moss, which I fully intend to read as soon as they release an e-reader edition. Moreover, from the oldie-but-goodie department, the c2 wiki has some exceptional material on garbage collection, complete with comments from veteran developers on the low-level details. Finally, Paul R. Wilson from the University of Texas has written a very accessible paper [PDF] providing an introduction to various GC topics.