Garbage Collection as a Memory Management Technique in Python

Demystifying the concept of Garbage Collection in Python

Joseph Osoo
9 min readJul 31, 2023
An image of robot trash collector used to introduce the article on garbage colelction concept in Python
Trash collector robot image from https://www.freepik.com

Introduction

Memory management is an important aspect of programming, as it determines how efficiently a program can use the available resources and avoid errors or crashes. It affects the performance, efficiency, and reliability of applications. In some languages, such as C and C++, you have to manually manage memory, which can be tedious and error-prone. However, in Python, memory management is mostly handled automatically by the interpreter, using a technique called garbage collection. This memory management strategy frees up the memory occupied by objects that are no longer needed, thus simplifies your coding and reduces the risk of memory leaks or crashes.

This article takes an in-depth dive into what garbage collection is, how it works in Python, and how you can interact with it using some built-in functions and modules. It also discusses some best practices and common pitfalls to avoid when dealing with memory management in Python.

What is Garbage Collection?

Garbage collection (GC) is a process of identifying and reclaiming the memory used by objects that are no longer reachable or referenced by the program. In other words, GC cleans up the memory that is not being used anymore, and then makes it available for new objects.

Garbage collection is different from manual memory management, where the programmer has to explicitly allocate and deallocate memory for each object. As much as manual memory management can be more efficient and precise, it requires more effort and attention from the programmer. Again, it often introduces errors like memory leaks (when memory is not freed after use) and dangling pointers (when a pointer points to an invalid or freed memory location).

Garbage collection, on the other hand, simplifies the programming task by automating the memory management. However, it also introduces some overhead and complexity. The interpreter has to periodically run the GC algorithm and pause the execution of the program. Moreover, GC does not guarantee that all unused memory will be reclaimed immediately or at all, as some objects may be kept alive by circular references or external references.

How does Garbage Collection work in Python?

Python uses a hybrid approach of reference counting and generational garbage collection to manage the memory. Reference counting is a simple technique that keeps track of how many references (or aliases) an object has. When an object is created, its reference count is set to one. Subsequently, when another variable or data structure refers to the same object, its reference count is incremented. Conversely, deleting or reassigning a reference to an object decrements its reference. When an object’s reference count reaches zero, it means that the object is not referenced by any part of the program, and it can be safely deleted from the memory.

Reference counting works well for most cases, but it has a limitation: it cannot detect circular references. A circular reference occurs when two or more objects refer to each other directly or indirectly, creating a cycle of references that prevents them from being garbage collected. Here’s a Python code snippet to understand circular references:

# create two objects
a = [1, 2, 3]
b = [4, 5, 6]
# create a circular reference
a.append(b)
b.append(a)
# delete the references
del a
del b

In the above code block, after deleting a and b, the two lists still exist in the memory, but they are not reachable by any variable or name. However, they still refer to each other, so their reference counts are not zero. Therefore, reference counting alone cannot free them from the memory.

To solve this problem, Python also uses generational garbage collection, which is a more advanced technique that divides the objects into different generations based on their longevity and frequency of change. The idea is that most objects are short-lived and die soon after creation, while some objects are long-lived and survive for a long time. By grouping the objects into generations, Python can focus on collecting the younger and more likely to be garbage objects first, and leave the older and more stable objects alone.

Python has three generations of objects: generation 0, generation 1, and generation 2. Generation 0 contains the youngest and most recently created objects. Generation 1 contains the objects that survived one garbage collection cycle while Generation 2 contains the objects that survived two or more GC cycles. Each generation has a threshold of how many objects it can contain before triggering a garbage collection cycle. By default, these thresholds are 700 for generation 0, 10 for generation 1, and 10 for generation 2 as shown below:

import gc
gc.get_threshold()
Output:
(700, 10, 10)

When a garbage collection cycle is triggered, Python first performs reference counting on all the objects in generation 0. If an object’s reference count is zero, it is deleted from the memory. If an object’s reference count is not zero, but it is part of a circular reference, it is moved to generation 1. Also, if an object’s reference count is not zero and it is not part of a circular reference, it stays in generation 0.

Next, Python performs generational garbage collection on generation 1. It uses an algorithm called mark-and-sweep to identify which objects are reachable and which are not. The algorithm works as follows:

  • Mark: Starting from a set of root objects (such as global variables and local variables), the algorithm marks all the objects that are directly or indirectly reachable from the root objects.
  • Sweep: The algorithm sweeps through all the objects in generation 1 and deletes the ones that are not marked. These unmarked objects are considered garbage since they are not reachable. The ones that are marked are moved to generation 2.

Finally, Python performs generational garbage collection on generation 2, using the same mark-and-sweep algorithm. However, this is done less frequently than in generation 1 because generation 2 contains the oldest and most stable objects.

How can we interact with Garbage Collection in Python?

Python provides some built-in functions and modules that allow us to interact with the garbage collection process and control some aspects of it. Let’s dive into some of these modules and their common functions in Python:

gc: This is a module that provides access to the GC operations and statistics. Some of the useful functions in this module are:

  • gc.enable(): This function enables the automatic garbage collection, which is the default behavior in Python.
  • gc.disable(): This function disables the automatic garbage collection, which can be useful for performance reasons or debugging purposes. However, this also means that the programmer has to manually invoke the GC when needed, using gc.collect().
  • gc.collect(generation=None): This function performs a full or partial GC cycle, depending on the optional argument generation. If generation is not specified or is None, it performs a full GC cycle on all generations. If generation is 0, 1, or 2, it performs a partial GC cycle on the specified generation and all younger generations. The function returns the number of objects that were collected and uncollectable (due to external references or finalizers).
  • gc.get_threshold(): This function returns a tuple of three integers representing the current thresholds for each generation.
  • gc.get_count(): This function returns a tuple of three integers representing the current number of objects in each generation.
  • gc.get_objects(generation=None): This function returns a list of all objects tracked by the GC, optionally filtered by generation. If generation is not specified or is None, it returns all objects. If generation is 0, 1, or 2, it returns only the objects in the specified generation.
  • gc.get_referrers(*objs): This function returns a list of objects that refer to any of the objects in objs, either directly or indirectly.
  • gc.get_referents(*objs): This function returns a list of objects that are referred to by any of the objects in objs, either directly or indirectly.

sys: This is a module that provides access to some system-specific parameters and functions. Some of the useful functions in this module related to memory management are:

  • sys.getsizeof(object[, default]): This function returns the size in bytes of an object, including its header and internal fields. If the object is not a built-in type or does not have a size attribute, it returns the value of default, which defaults to 0.
  • sys.getrefcount(object): This function returns the reference count of an object. Note that this may be higher than expected, as some temporary references may be created by the interpreter during the function call.

weakref: This module provides support for creating weak references to objects. A weak reference is a reference that does not increase the reference count of an object, and does not prevent it from being garbage collected. A weak reference can be used to create caches or mappings to objects without affecting their lifetime. Some of the useful classes and functions in this module are:

  • weakref.ref(object[, callback]): This class creates a weak reference to an object. The optional argument callback is a function that will be called when the object is about to be finalized (i.e., deleted from memory). The weak reference object can be called like a function to retrieve the original object, or return None if it has been garbage collected.
  • weakref.proxy(object[, callback]): This class creates a proxy object that acts as a weak reference to another object. The proxy object behaves like the original object in most aspects, except that it does not have its own identity (its id and hash are different). The optional argument callback is a function that will be called when the original object is about to be finalized. If the original object is garbage collected, the proxy object will raise a ReferenceError when accessed.
  • weakref.WeakKeyDictionary([dict]): This class creates a mapping from weak references to objects to arbitrary values. The keys are weak references to objects, and the values can be any Python object. The keys will be automatically removed from the dictionary when the objects they refer to are garbage collected. The optional argument dict is a regular dictionary that can be used to initialize the weak key dictionary.
  • weakref.WeakSet([iterable]): This class creates a set of weak references to objects. The elements are weak references to objects, and they will be automatically removed from the set when the objects they refer to are garbage collected. The optional argument iterable is an iterable that can be used to initialize the weak set.

What are some best practices and common pitfalls to avoid when dealing with memory management in Python?

Memory management, as we have seen, can be challenging in Python. However, it is upon the programmer to engage in memory management best practices. Here are some tips and recommendations to follow when working with memory management in Python:

  • Avoid creating unnecessary or temporary objects, as they consume memory and trigger GC cycles. Use generators, comprehensions, or iterators instead of lists or tuples when possible, as they produce values lazily and on demand.
  • Avoid creating circular references, as they prevent objects from being garbage collected by reference counting. Use weak references or break the cycles manually if needed.
  • Avoid using global variables or long-lived objects that hold references to other objects, as they may prevent them from being garbage collected. Use local variables or function arguments instead, or clear the references when they are no longer needed.
  • Avoid using large or complex data structures that consume a lot of memory, such as nested lists or dictionaries. Use specialized modules or libraries that provide more efficient or compact representations, such as array, collections, numpy, and pandas.
  • Monitor and measure the memory usage and performance of your program, using tools such as memory_profiler, tracemalloc, and objgraph. Identify and optimize the parts of your code that consume the most memory or trigger the most GC cycles.
  • Be careful when using external libraries or modules that may create or manipulate objects outside of Python’s control, such as C extensions, databases, and GUI frameworks. They may have their own memory management mechanisms that may not be synchronized with Python’s GC. Read their documentation and follow their guidelines on how to properly use and release their resources.

Conclusion

In this article, we have learned about garbage collection as a memory management technique in Python. We have seen how Python uses reference counting and generational garbage collection to identify and reclaim unused memory. We have also learned how to interact with the GC process using some built-in functions and modules. The article has also provided some tips and tricks on how to avoid some common pitfalls and improve memory usage and performance in Python.

until next time GIF

--

--

Joseph Osoo

Backend Engineer-cum-Data Evangelist || A Passionate writer creating technical content for SaaS. Everything Data, Machine Learning, AI, and Backend Engineering