Garbage Collector

This post cursory looks over some definition and aspects about Garbage Collectors. It serves to provide beginners useful insights but in no way is complete and therefore correct.

Why Garbage Collector is useful

A lot of languages allow us to allocate memory dynamically and use them. This memory needs to be released back to the system at certain point when not in use otherwise we can run out of memory. It is important to note the following:
* we end up using a lot of libraries and frameworks that are developed by others.
* most of the memory needed is short lived.
If there is no garbage collector then we need to decide a protocol to “how to free memory”. Is it the allocator who is responsible or there is some reference count support and who will look out to free circular references etc. We might end up creating a simpler version of collector ourselves.

Definitions

  • Concurrent Garbage Collectors — They run in parallel with the application itself.
  • Parallel Garbage Collectors — They use more than one thread for garbage collection.
  • Mutator — part of application that changes memory/objects.
  • Safe-point — Its that point or range of code where no mutator is running.
  • GC safe-point — when all the threads are at safe-point.

Note that code might still be running for reasonable amount of time at safe-point e.g. JNI calls. 
It is easy to see that garbage collection is much easier problem when we are at GC safe-point when developing a non-concurrent and single threaded collector. Also, note that we would prefer to have as many safe-points as possible. However, one should not consider injecting the safe-points artificially.

GC Approaches

  1. Mark/Sweep/Compact 
    Mark: Starting with statics and Thread stack roots, one can visit and mark all the objects that is reachable. The time spent would be O(active heap size)
    Sweep: freeing
    up all the unmarked objects. The time spent would be O(whole heap size).
    Compact: relocate and remap objects.
    The time spent would be O(active heap size).
    Note that compaction is needed since over the time the memory gets fragmented but not every GC cycle may do it.
    As one can observe that compaction slows down the approach to a great extent. Also if you noticed, if one doubles the heap size, the frequency of GC cycles will reduce but when GC occurs then the “stop the world” pause time would be doubled as well.
  2. Copy Collector
    Weak Generational Hypothesis: Active/live heap will be a small % of whole heap space. 85% of objects die fast.
    This approach considers objects either in young generation or in old. Each object’s age (e.g. no. of GC cycle seen) along with some other meta-data is maintained. The idea is that mostly we will do GC in young generation and if needed arises then in old generation.
    The young generation can be seen as divided into two spaces. In each GC cycle, we move all the reachable objects from first space to the other. Its a monolithic process that can take space of 2 * size of live heap size. Remembered sets are used to track all references in a young generation for each thread stack which helps avoid scanning the whole heap.
    Usually we expect, a lot of objects to be collected and therefore we keep less than 2x space of live heap. This might end up placing few objects in old generation in case of space crunch.
    However, after few cycles, we might need to GC old generation as well which could be mark/compact or mark/sweep/compact.

Moving objects
An object might be referenced by multiple objects so moving an object might mean a lot of updates. This becomes more of a challenge if the objects are getting changed often (high mutation rate). Thus one might observe the hardness in designing concurrent collectors. 
One more challenge is differentiating between fields like an integer and a memory/object reference. How can a collector know it? In order to achieve this, someone needs to provide lot of extra information to the collector which usually is a compiler who provides oopmaps. Thus this introduces two categories of collectors:
* Precise collectors — can differentiate between each and every object reference and fields at least at safe-points.
* Conservative collector — cannot always differentiate.

Notes

  1. A young generation GC (Minor GC) is always triggered when VM is not able to allocate memory for a new object. So, it is a function of allocation rate.
  2. Due to copying, no fragmentation actually takes place in young generation space.
  3. The young generation GC will result usually in small “stop the world” pauses.
  4. Repeating again — doubling heap size can decrease the frequency of GC cycles but old Gen pause might just double as well.
  5. A truly concurrent collector will not have “stop-the-world” pauses. (related read)

Please note that depending on your VM, the behavior might change for good or bad but this should give you a good idea.