Debugging and preventing memory errors in Python

Published in

Brex Tech Blog

15 min readMay 2, 2022

At Brex, we use Python extensively for Data Science and Machine Learning applications. On the Data Platform team, we often need to write applications that can execute Python code written by our Data Science (DS) team, like on-demand feature computation or inference for predictive models.

Because these Python services interact with real-life data, there can be significant variance in the amount of data processed per request. For example, a request to “run inference to predict customer X’s future spending” might have very different memory usage requirements, depending on how much data is available on customer X’s past spending patterns.

This set the stage for one of our most recent reliability challenges: dealing with out of memory errors (OOMs) in our model-serving services, equipped with the knowledge that some of the requests will not necessarily be “well-behaved.” This one was quite an interesting challenge to debug, so we figured it was worthwhile to share some of our learnings with the broader community.

Our basic setup

Let’s start off with some basic info about our setup. Everything that follows (unless specifically called out) has been tested on this specific setup:

Python version: 3.7.3 (I know, upgrading is on our list)
All of our Python services get built into docker images using Bazel (after being packaged into a binary executable)
Our services run in a Kubernetes cluster on AWS
All of our Python services are running on a standard gRPC server process
You will see some references to async workloads — the basic setup for these are Python Celery workers consuming from a queue backed by Redis

Why do we need to address OOMs?

Perhaps the answer to this is quite obvious (i.e., OOMs cause services to fail), but it’s worth digging into what exactly happens when a python process “runs out of memory.” The Python process itself is blissfully unaware of the memory limit imposed on it by the OS.¹ Instead of self-limiting the amount of memory it will allocate, the Python process will simply attempt to allocate more memory to do whatever it needs to do (for example, to load in some more data from a database) — only to immediately receive a SIGKILL from the OS when it hits its memory limit.

This not only stops the process from completing the task(s) it was previously running, but it will also not allow for any kind of “graceful” shutdown. This is probably the worst possible thing that could happen to your process while running in production — in addition to preventing any ongoing requests from completing, it is also extremely difficult to debug. After all, the inability to shut down gracefully prevents any server-side errors, metrics, or traces from being generated to aid in service observability.

To be more concrete, consider the following two different cases where we saw real-life OOMs happen at Brex.

OOMs in synchronous requests on a server

This case is essentially the same case outlined in the intro: a server runs out of memory while processing a request. The dangerous thing here is that our servers are almost always handling more than one request at a time — when a single one of these requests causes a process to run out of memory, all requests being handled by that process will immediately fail. Even worse: since a graceful shutdown is not possible, we cannot control how they fail, so that we can’t even ensure the client receives a retriable error message.

Worker processes running out of memory

In addition to processing requests synchronously, we also have Python processes that act as workers consuming tasks from a queue. In this case, each task is performed sequentially by each worker, so the problem above does not exist. However, because the SIGKILL makes a graceful shutdown impossible, these tasks simply remain unacknowledged by the worker. For our specific Celery setup, this means we end up hitting the visibility timeout, so that our workers repeatedly attempt to consume these OOM-inducing tasks ad-infinitum (more elaborate async setups will, of course, have additional retry redundancies, which ensure this does not happen).

How can we debug these errors?

All this doom and gloom brings us to the question: how do we even begin to address this problem? Our initial intuition here was fairly straightforward — some combination of the following hypotheses must be causing the service to run out of memory:

There was a memory leak in our server implementation, causing it to use more and more memory in each request and eventually run out of memory.
While none of the individual request were using too much memory, the combination of all requests at any given time would use too much memory in aggregate.
There are requests which, on their own, legitimately use more memory than a single process has available.²

In a server that handles a relatively constant number of requests over time, you might think that a memory leak would be quite apparent as simply monotonically increasing memory usage. You would be correct. In this case, however, we were dealing with a server that has very bursty workloads, so there was no opportunity to spot the traditional slowly increasing memory usage over time.

Solving this issue came down to testing these hypotheses against what we saw in reality. In a perfect world, we could distinguish between these hypotheses by measuring:

A. The amount of memory allocated directly attributable to handling each request

B. The percentage of (A) deallocated after the process finishes responding to each request

The world we live in, however, is d̶a̶r̶k̶ ̶a̶n̶d̶ ̶f̶u̶l̶l̶ ̶o̶f̶ ̶t̶e̶r̶r̶o̶r̶s̶ not perfect, and these two numbers are not readily available. This is mainly because these OOMs tended to happen at times when our service was handling a large number of concurrent requests at once. This is not surprising. After all, any of the three hypotheses laid out would cause the probability of an OOM to increase when the server is handling a large number of concurrent requests. Each request in our server is handled by a separate long-lived thread (spawned from a ThreadPoolExecutor) — a fairly standard setup for Python gRPC servers. Because we cannot measure the memory used by each individual thread², we cannot easily measure the amount of memory used in each individual request.

Our solution to this problem was to essentially force our server to execute each request one at a time, so that we could more easily measure the amount of memory used per request. The easiest way to do this would be to simply turn down the concurrency of the gRPC server to 1.³ If this were at all a reasonable option for us, we would have taken it without hesitation. However, this would have meant that any requests coming in beyond our concurrency capacity would queue until previous requests were processed. This might have helped us debug our OOM problem, but it would have definitely made us worse off in our overall reliability.

In this case, we had to get more creative: in addition to executing each request on a thread off of the main server process, we would also execute each request entirely asynchronously in an isolated worker process, as in the diagram below.⁴

As you can see, the actual results from the asynchronous computation are not used at all — we simply use the async execution to run each request in an isolated manner.⁵ In our asynchronous worker, we can simply measure the amount of memory used before and after each request is executed (we used psutil.Process.full_memory_info to obtain the resident memory used). This allows us to gain a very detailed understanding of the memory allocated for (A) and deallocated⁶ after (B) each request so we can effectively distinguish between hypotheses (1) and (2). In addition, because requests were running without any contention for resources, we could directly attribute any OOMs that occurred to the specific requests being processed on that worker at that point in time. This, in addition to measuring the memory used before a request was processed, allowed us to differentiate between (1) and (3).

So, what was our conclusion? Are you holding your breath? At the end of the day, hypothesis (3) was the main cause of our memory issues in this specific instance. This became quite obvious when we saw the asynchronous behavior: a single request was repeatedly causing a worker process to OOM and being continuously retried, as Celery was never able to acknowledge the task after consuming it from the Redis queue.

Attempts at memory profiling

We should briefly mention that we attempted to profile the memory used by this server in order to identify which bits of code were allocating memory that was causing our OOM. We attempted to use tracemalloc, largely following the steps outlined in this blog post, creating an asynchronous thread which would report the top memory allocation tracebacks in our code.

Unfortunately, after deploying this piece of instrumentation to production we almost immediately saw a significant increase in the number of OOMs reported for our service. From the few logs we were able to recover after this attempt, it was clear that the vast majority of the memory allocation was in fact coming from the usage of tracemalloc itself. At the end of the day, using this profiling tool ended up creating far too much overhead to be useful.

Memory limits to the rescue (sort of)

The process above for debugging allowed us to understand when and how requests cause OOMs, but it comes with a couple of drawbacks:

It is inherently reactive, as it allows the OOM to continue to happen on the gRPC server process, and
As outlined, this would require quite a significant amount of upkeep for what ends up being just a debugging tool at the end of the day.

In an ideal scenario, we would be able to prevent the OOM error in the first place, which essentially requires stopping the Python process from allocating more memory than it is allowed. From an implementation perspective, the desired behavior would be that the Python interpreter would, instead of attempting to allocate more than its fair share of OS memory, simply raise a MemoryError in the codepath attempting such an allocation. Then, at the very top level of the execution stack in the gRPC server, we can handle such errors and return a retriable error code to the client (in this case the RESOURCE_EXHAUSTED gRPC code seems appropriate).

In a world where we can trust clients to be well-behaved (thankfully, we live in such a world on the Data Platform team at Brex), clients can back off and retry, hopefully reaching either (a) the same exact process at a later point in time, when there is less contention for memory or (b) another instance entirely of the same service which is using less memory (provided a load balancer is appropriately employed).

Of course, we wouldn’t be discussing this scenario unless there were a reasonable method to implement it. Enter: the Python resources module.

I was quite surprised to find that there were very few instances documenting this use case for the Python resources module (aside from one excellent and succinct blog post by Carlos Becker, which much of this discussion will build on). After spending some time working with the resources module, I now have some suspicions as to why there is scarce information on this use case: imposing resource limits without a thorough understanding of Python’s memory management system can lead to unexpected behavior (more on this in a bit). Nevertheless, this module does allow us to set a limit on the memory a process is allowed to access, effectively turning OOMs (which affect the entire process) into MemoryErrors (which only affect a single execution path/thread, and can be handled appropriately by the caller). In fact, setting up memory limits is as simple as running the following function at the very top of your process:

The snippet above will ensure the process’s heap (where Python allocates the vast majority of its data) does not grow beyond the limits imposed by its cgroup.⁷ If you are running your python process on a Kubernetes cluster, this will correspond to the memory resource limit for the container running said process.

That was easy enough, wasn’t it? Oh, what is that? Your process is now raising MemoryErrors if you look at it funny? Ok, maybe I missed something here…

Thread stack size

When we initially tested out this fix, it almost immediately crashed and burned in our development environment. The problem ended up being simple: we were not accounting for the fairly massive default Python stack size on Linux systems (roughly 10Mb by default). This is the size of the C stack, which gets allocated on the heap of the process as soon as any thread is created. If there is insufficient memory to spawn new threads, then your multi-threaded program will have an absolutely miserable time. Considering the Very Large amount of memory used by a single Python thread, and the relatively low memory limits we were dealing with (originally, each process was set up to have a 2Gb limit), this meant that a process would very quickly reach its memory limit due to the threads it was spawning alone.

The solution to this? Well, quite simply: set the stack size to some lower and more reasonable value (in our case, we ended up with 2Mb). This, of course, does not come without its dangers — if the Python process is making calls which result in the C stack size growing significantly, then this can be problematic. For our purposes, the 2Mb limit is quite comfortable (we could probably get away with going far lower than that):

Behavior when limiting address space

It is worth noting for posterity here (and for others that happen to stumble upon this problem) that the choice to use RLIMIT_DATA was not a simple one. In fact, we initially attempted to set RLIMIT_AS (as in the original post that described this solution), but we found that this created some fundamentally weird problems. To get a flavor of what I’m talking about, you can try running this Python snippet yourself:⁸

The snippet above will error out once it gets to thread 18 — this is far lower than the memory limit of 1Gb should allow for. Through some experimentation and learning about how Python actually manages its memory (it essentially allocates any interesting bits of memory used on the heap), we ended up concluding that using RLIMIT_DATA was more appropriate. In practice, this proved to be fairly well-behaved: the code above runs swimmingly when we swap out RLIMIT_AS with RLIMIT_DATA — in fact, in that case you only begin to see failures when we attempt to spawn hundreds of threads, which is consistent with the default stack size of 10Mb discussed above.

Bringing it all Together

After the long process of learning and debugging that led to the info above, we can now summarize exactly how we changed our gRPC servers to behave more nicely under memory contention. For starters, at the very top of our entry point, we added:

Finally, in our actual server code, we can simply wrap all of our endpoints with the following decorator:⁹

And that’s that! Well, almost… this solution will correctly handle most allocation failures — for example, if one attempts to load in too much data from a database inside of an endpoint handler. However, there are some potential edge cases that should also be addressed, if one’s intent is to be thorough (which we, of course, all are). Specifically, there are some errors that are caused by memory allocation failures but do not come up as MemoryErrors:

Exceptions that are raised from MemoryErrors, but which do not themselves preserve the exception type. It is relatively straightforward to account for these, by simply recursing down into an exception’s __cause__ and inspecting if it is a MemoryError instance
Exceptions raised when the interpreter cannot start new threads due to memory contention (ie: if there is insufficient memory available to allocate a new stack). It is easy to reproduce this error yourself using a python interpreter:

Python 3.9.7 (default, Sep  3 2021, 20:10:26) 
[GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import resource
>>> from threading import Thread
>>> 
>>> resource.setrlimit(resource.RLIMIT_DATA, (10_000_000, -1))
>>> t = Thread(target=lambda: 1 + 1)
>>> t.start()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.9/threading.py", line 892, in start
    _start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread

Accounting for both of these types of exceptions is relatively straightforward, if a bit awkward for the latter one:

Finally, we can slightly change our decorator above to use this new error handler:

This solution, of course, will not get rid of all your memory problems, of course. If you operate in a memory constrained environment (which all atom-bound computation does, at the end of the day), you will always have some point at which you run out of memory. However, this solution will allow you to (a) better isolate failures due to badly behaving requests and (b) to fail gracefully whenever your application does run into memory contention.

What about memory leaks?

Going back to our initial hypotheses when debugging these OOMs: the solution above will help whenever cases 2 and 3 are hit, but in the case of memory leaks (our first hypothesis), this would not be very useful. In fact, the changes above might actually reduce a service’s reliability if said service has a significant memory leak. Consider the example of a server that leaks some memory on every request made to it. After some time, that process will eventually consume enough memory that it cannot process any new requests. If the solution above is applied, this process will stay alive in a “bad state” indefinitely. On the other hand, if we remove the memory limits using the resources module, the resulting OOMs will kill the process (at which point whatever orchestration method you are using — in our case, Kubernetes — should restart the process with low memory usage). How, then, can we ensure that memory leaks don’t end up negatively affecting our reliability?

It is important to start right off the bat here with a lecture: the way to avoid operational issues due to memory leaks is to fix them. I have seen many cases where production services rely on an orchestration mechanism to restart them periodically in order to avoid leaking memory. While this might prove a workable short-term solution, it is hardly an engineering standard we should build towards. Instead, one should aim to (a) be alerted whenever a potential memory leak occurs and (b) have some mechanism to limit the negative effects of a memory leak in a production environment, when it does occur.

With that out of the way, how does the solution proposed here perform for our criteria (a) and (b)? Well, (a) is already handled pretty well: if a server is leaking memory, then this solution will lead to the server eventually being unable to respond to most requests (which ought to trigger some of your automated alerting, right?). For (b), however, the existing solution is woefully insufficient. The solution: to modify the server’s liveness and readiness checks to take into account used and available memory. While we will not go into the details of the code to do so (it should be relatively straightforward… and plus, you’re probably getting tired of reading about OOMs by now), we can briefly describe a couple of potential solutions here.

Monitor the memory available to the process in the readiness check — if the memory is consistently below a certain threshold (i.e., the threshold you estimate is necessary to process each marginal request), then have the readiness check fail. If this state persists for an extended amount of time, then fail the liveness check as well (which will cause Kubernetes to restart your pod/process).
Keep track of the number of times your server encounters an “allocation error” over some recent time interval. If that value exceeded some percentage threshold of all requests, then have the liveness check return a failed response.

Of course, both of these methods suffer from the issue that they will kill your application, potentially interrupting requests that would otherwise have succeeded! This is why, at the end of the day, all you can hope for is to detect and fix any potential memory leaks in order to ensure your server is reliable!

[1] By default. We will explore ways to change this in a later part of this post.

[2] In fact, “thread memory usage” is not even a well-defined quantity, let alone an easily measurable one.

[3] This could be accomplished by simply setting the max number of workers for the ThreadPoolExecutor mentioned above to 1.

[4] Ok, this is a half-truth. The asynchronous workflow was built independently to handle similar requests to the synchronous workflow, and that ended up having the nice consequence of making OOM errors much easier to debug. Nevertheless, the strategy we present here is a perfectly sound strategy for cases in which one wishes to debug such errors while still using a completely synchronous workflow.

[5] Note that, in order to execute the asynchronous request in the first place, we had to deploy additional resources in our Kubernetes cluster. This might seem counter-intuitive, since we were attempting to address an issue that arises from a resource-constrained environment. The reality here is that we were not resource constrained in the absolute sense: we are far below our instance limits, so it is quite easy to deploy new instances to handle the async requests. However, the resources were constrained on each single node, which is what generated the OOMs in the first place.

[6] This is also a half-truth. Python’s memory management system is more complex than this, since it manages memory allocation differently for objects under 512 bytes. Specifically, the memory allocator will not necessarily return memory back to the OS when small objects are deleted. In our particular case (and I suspect this is fairly generalizable) the bulk of the memory allocation was coming from large objects, so the measures of resident memory before vs after a request is processed were quite representative of the “true” memory allocated/deallocated by the python process during the request lifecycle.

[7] If you are running an application in a docker container, the cgroup memory limit is the same as the limit imposed by Docker on the container.

[8] We originally tested this out using the python:3.7.3 docker image, but have since confirmed the same behavior on 3.9.7.

[9] In practice, we use a gRPC server interceptor here — the decorator should serve the same purpose, though, and is significantly easier to follow if you are not very familiar with interceptors.