Demystifying Python Garbage Collection

Published in

Thumbtack Engineering

5 min readMar 1, 2024

Introduction

At Thumbtack, we use machine learning for helping customers find professionals that can help with home services or for problems such as recommending various home service categories (like Plumbing, House Cleaning) to customers. Machine Learning (ML) model inference is the process of taking in model inputs (features) and calculating model outputs. To provide a great customer experience we need to deliver inference results in a timely fashion which means running ML inference with a latency budget of 50–150 milliseconds.

I work on the ML Infrastructure team. One thing we do is help our client teams ensure that their model returns results in time for >=99.999% of requests (less then 10 timeouts in a million). Typically we use libraries with Python bindings to implement the inference. Different ML models require different runtimes (e.g. TensorFlow vs PyTorch vs ONNX). On a few occasions we had much worse long tail latency then our benchmarks or a load test suggested.

For example, when we were deploying our first PyTorch based model to canary we noticed an unacceptable amount of timeouts (taking more than 100ms to execute) even though both benchmarks and our artificial load test suggested it should take typically 20–30 ms with occasional spikes to 50ms. One thing which was very easy to blame for these timeouts is garbage collection (GC), which would fit this pattern of timeouts despite successful load tests because GC’s complexity can be affected by sustained uptime. After all, it has a bad reputation.

In this article we will show how to measure Python GC performance in production. And more importantly we will show how we attribute GC delays to latency changes in the Python environment.

How does garbage collection work in Python?

Python uses reference counting for most of memory management. It is a cheap and efficient method which does not require garbage collection (GC) unless objects form reference cycles. Python uses GC to ensure that memory is not leaked if there are cycles. For the purpose of this post, we will treat garbage collection mechanics as a black-box. The only thing to know is that it is “stop the world” garbage collection which means none of the code can be executed while GC is in progress.

In other words: if GC happens during processing a user request, the request will be delayed for whatever amount of time Python needs to finish garbage collection.

How do we measure GC impact on latency?

Python standard libraries include gc which (among other things) has a `gc.callbacks` function list. Python will call all functions from this list before and after each GC. That is how we can size the duration of GC!

Note the callback is called with two parameters:

phase which can be either start or stop
info which contains all other information

To illustrate the idea took at this code:

import time
import gc
from typing import Mapping

start_time = None
time_in_gc = []

def gc_callback(phase: str, info: Mapping[str, int]) -> None:
  global start_time
  global time_in_gc
  if phase == 'start':  
    # this indicates the function is called before garbage collection  
    start_time = time.time()  
  else:
    # phase have only 2 possible values: 'start' and 'stop'
    duration = time.time() — start_time
    start_time = None
    time_in_gc += [duration]
  
gc.callbacks += [gc_callback]
print(time_in_gc)
gc.collect()
print(time_in_gc)

Our solution

In our production system, we don’t just measure GC duration, but also attribute it to slowing down the processing of a user request. We found great inspiration in this class from mypy project:

And this code can be used as:

with gclogger.GcLogger() as gc_logger:
  result = self.do_get_prediction(getPredictionInput)

Note that in this class we use a context manager: it is a class which implements `__enter__` and `__exit__` functions.

`__enter__` function is called before entering with a statement. `__exit__` is called when exiting. More importantly, `__exit__` is called even if an exception is raised!

Using a context manager makes it possible to add this callback only for the duration of the critical code and remove it afterwards. In our case we wrapped our ML inference code with this context manager.

Attaching GC introduced delay to latency

After using GcLogger() from above one would be left with `gc_logger` variable. And one can call `gc_logger.get_stats()` to get all needed stats. We typically send this information to InfluxDB which is a scalable metrics datastore. It permits us to store this data for each user request and query it later interactively using Grafana which is the dashboard solution of our choice.

In our Grafana dashboard we count the percentage of requests which added ≥ 16ms to the ML inference duration. If this is 1% or more then it does affect P99 latency. If this is 0.1% it will still affect long tail performance, but much less.

Measuring GC delays is useful

Now let’s come back to our example from the introduction. When we deployed our first PyTorch model to canary we observed really poor long tail latency performance. But the Grafana board we built showed that <0.1% of queries added ≥ 16ms to the latency. So this told us that in this case GC was not to blame! The absolute majority of timeouts were not due to GC, but there was something else at play. And instead of wasting engineering time on optimizing object allocation, we pursued a different way to solve this latency problem — that is another story on its own.

Our experience recording delays caused by GC showed that almost always GC is not the reason for timeouts. That being said, on one occasion GC was the problem, and our measurements led us to discover this very quickly after deploying to canary.

Conclusion

Python standard library provides gc module to permit measuring GC caused delays. By measuring them and attributing to appropriate request latency measurements, Thumbtack engineers can answer “Can excessive GC explain poor long tail latency performance for my application?”. In turn this helps us to deliver a better customer experience.

Acknowledgement

I would like to thank Navneet Rao and Richard Demsyn-Jones for their feedback on this post. I would like to thank the entire ML Infra team at Thumbtack for their support throughout this project.