Python: Under The Hood [Garbage Collection]

Vasu Pal
Analytics Vidhya
Published in
5 min readMay 7, 2020

--

Hi, welcome to the 2nd part of Python: Under The Hood series, in this article, we’ll look at what garbage collection is and how it works in the context of Python language.

Let’s first talk about What is Garbage Collection? and What does it do?

Garbage collection is a technique in computer science, which is used to clear up memory occupied by the objects that are no longer required by the program by removing them from the memory.

All HLL programming languages implement garbage collection internally as the core of the language so that developers don’t have a deal with managing memory allocation and de-allocation explicitly.

Now, coming to question What does it do?

Majorly it does two things:

  1. Memory clean up
  2. Heap fragmentation

Okay, I understand memory clean up but what is heap fragmentation?

Programs in python require contiguous memory blocks, and when an object is removed during the execution, a specific memory block gets freed and ready to be occupied by another object.

The only issue here is that not every object in python requires a single memory block. Some might take up multiple memory blocks such as a list in python which requires contiguous memory blocks.

So, what heap fragmentation does is,

It aggregates all the free blocks inside memory to form a contiguous memory block.

Let’s understand this with an example

Say, our program during execution has a memory structure like this.

Heap memory representation
Heap memory representation

Let’s say our garbage collector has removed the 3rd object (300) from the memory and then, we try to create a list object containing 3 integer values.

Cannot add list due to unavailability of contiguous memory

Even though there are 3 blocks available to contain 3 integer objects but since the list requires contiguous memory blocks and only 2 blocks are available thus this operation cannot be performed on that specific memory block.

Clearly, this isn’t the most efficient way to manage memory. Thus, here comes heap fragmentation to the rescue.

add list operation after fragmentation

GC after deleting the object performs fragmentation and puts the free memory blocks contiguously. Thus allowing the “add” list operation to be performed.

Cool! Now let’s talk about how python implements garbage collection?.

Though are multiple garbage collection techniques or garbage collectors GCs. You can take a look at these types here.

Interestingly, Python uses 2 garbage collectors named Reference Garbage Collector and Generational Garbage Collector.

Reference Garbage Collector

Every PyObject in python maintains a RefCount which counts the number of references it has i.e how many variables are pointing to that object.

In my first article, I discussed how RefCount changes during the program execution state.

Reference GC is the main garbage collection mechanism in python, and there is no way to disable or replace the implementation. However, we can use getrefcount(<variable>) from sys module to count the number of references an object holds.

import sys
a = "xyz"
sys.getrefcount(a)
>>2

Notice that there are two references. One is from creating the variable at line 2 and second is when we pass the variable “a” to the sys.getrefcount() function at line 3.

Despite having such robust implementation, it holds some drawbacks as well.

  • It does not detect cyclical references.

Cyclical references occur when two object references each other. Thus, the RefCount of these objects never reaches Zero.

Cyclical references between objects
  • It’s not thread-safe.

It is plausible in a multi-threaded environment that two threads can access/modify the same object simultaneously. And Reference GC has no mechanism to deal with it.

Generational Garbage Collector

The key concept behind generational GC is that most of the new objects become unreachable quickly. Thus, we can separate the objects based on their life and allocate in different generations.

Python has 3 generation of objects. One object residing in only one generation at a time.

During the execution, whenever an interpreter encounters a new object, it puts the object into generation-0.

For each generation, the garbage collector holds a limited number of objects i.e the threshold. When the number of objects exceeds that threshold, the garbage collection process is executed. If an object survives this process, the object moves up into the next generation (or stays in the same generation in the case of generation-3 objects).

Unlike reference GC, you can modify the generational GC behavior such as to modify the threshold, manually trigger the garbage collection process, or disable it altogether. Click here to check out the GC docs to know more.

Alright! but does it solve the issue of the cyclical reference?

Yes, Generational GC keeps track of the object references across generations by observing the creation and overwriting of references and is able to detect objects that are unreachable without having to traverse the entire reference tree.

And what about the threading issue?

Well, Python solves this issue by dealing with multi-threading in a very interesting way. Python implements a mechanism called Global Interpreter Lock or GIL.

In the most basic sense,

GIL allows only one thread that can be in a state of execution at any point in time.

Thus, no need for a thread-safe environment.

We’ll learn more about GIL is my next article.

--

--

Vasu Pal
Analytics Vidhya

Founding member @hypd.store. Building tech around creator-commerce. Empowering creators to host their content, products & merchandise using easy access stores.