Monitoring memory usage of a running Python program
At Survata, we do a lot of data processing using Python and its suite of data processing libraries like pandas and Scikit-learn. This means we use a lot of cloud computing resources, and as a result, our monthly hosting bill can be… hefty.
One way to trim the amount you spend on cloud resources is to make sure you don’t ask for more resources than you actually use. Cloud providers make it really easy to spin up a multiple-GB-of-RAM server — but if your actual running process only uses a fraction of that memory, you’re wasting resources — and that means money!
However, you can’t optimize the resources you use if you don’t know what you’re actually using.
Option 1: Ask the operating system
The easiest way to track memory usage is to use the operating system itself. You can use top
to provide an overview of the resources you’re using over time. Alternatively, if you want a spot inspection of resource usage, you can use the ps
command:
The -m
flag instructs ps
to show results in order of which processes are using the most memory. The -o
flag controls which properties of each process are displayed — in this case, the percentage of CPU being used, the percentage of system memory being consumed, and the command line of the process being executed. The CPU percentage counts one full CPU core as 100% usage, so if you have a 4-core machine, it’s possible to see a total of up to 400% CPU usage. There are other output options to display other process properties, and other flags to ps
to control which processes are displayed.
Combined with some creative shell scripting, you could write a monitoring script that uses ps
to track memory usage of your tasks over time. Most hosting providers will also provide dashboards for monitoring machine-level resource usage.
There are also profilers like py-spy that can be used to wrap the execution of a Python process and measure it’s memory and CPU usage. These profilers use operating system calls, combined with a knowledge of how Python code executes, to take periodic measurements of your program as it runs, and identify which parts of your code are using resources.
Unfortunately, this approach isn’t always viable for data pipeline tasks. In our situation, we’re using AWS Batch as a host for our compute tasks, which obscures the operating system-level interface. Each deployed task is wrapped in a Docker container; that task then nominates how much memory and CPU it needs to run.
This containerization process obscures how much memory is being used inside the container. From the hosting provider’s perspective, a Docker container that allocates 8GB of RAM is using all that memory, even if the code running inside the container only allocates a fraction of that amount.
So — we need to monitor memory usage inside the container.
Your first inclination might be to use the same operating system techniques, but inside the container. While this does technically work, general advice is that a Docker container should run a single process — so running a second monitoring process inside a container isn’t a good option.
Measuring memory usage from outside the running process also obscures collection of metrics that would allow correlate memory usage with properties of the data being analyzed. For example, does memory usage scale with the number of data in the data set? Or is it related to the complexity of the analysis performed? When analyzing at the level of the operating system, it may be difficult to collect metrics on the operation of the underlying analysis.
What we need is a way to monitor the memory usage of a running Python process, from inside that process.
Option 2: tracemalloc
The Python interpreter has a remarkable number of hooks into its operation that can be used to monitor and introspect into Python code as it runs. These hooks are used by pdb to provide debugging; they’re also used by coverage to provide test coverage. They’re also used by the tracemalloc module to provide a window into memory usage.
tracemalloc
is a standard library module added in Python 3.4 that tracks every individual memory blocks allocated by the Python interpreter. tracemalloc
is able to provide extremely fine-grained information about memory allocations in the running Python process:
Calling tracemalloc.start()
starts the tracing process. While tracing is underway, you can ask for details of what has been allocated; in this case, we’re just asking for the current and peak memory allocation. Calling tracemalloc.stop()
removes the hooks and clears any traces that have been gathered.
There’s a price to be paid for this level of detail, though. tracemalloc
injects itself deep into the running Python process — which, as you might expect, comes with a performance cost. In our testing, we observed a 30% slowdown when using tracemalloc
on a running analysis run. This might be OK when profiling an individual process, but in production, you really don’t want a 30% performance hit just so you can monitor memory usage.
Option 3: Sampling
Luckily, the Python standard library provides another way to observe memory usage — the resource module. The resource
module provides basic controls for resources that a program allocates — including memory usage:
The call to resource.getrusage()
returns the resources used by the program. The constant RUSAGE_SELF
indicates that we’re only interested in the resources used by this process, not its children. The object returned is a structure that contains a range of operating system resources, including CPU time, signals, context switches and more; but for our purposes, we’re interested in maxrss
— the maximum Resident Set Size — which is the amount of memory that is currently held in RAM by the process.
However, unlike the tracemalloc
module, the resource
module doesn’t track usage over time — it only provides a point sampling. So, we need to implement a way to sample memory usage over time.
First — we define a class to perform the memory monitoring:
When you invoke measure_usage()
on an instance of this class, it will enter a loop, and every 0.1 seconds, it will take a measurement of memory usage. Any increase in memory usage will be tracked, and the maximum memory allocation will be returned when the loop exits.
But what tells the loop to exit? And where do we call the code being monitored? We do that in a separate thread.
A ThreadPoolExecutor
gives us a convenient way to submit tasks to be executed in a thread. We submit two tasks to that executor — the monitor, and my_analysis_function
(if the analysis function requires additional arguments, they can be passed in with the submit call). The call to fn_thread.result()
will block until the analysis function completes, and its result is available, at which point we can notify the monitor to stop, and get the maximum memory. The try
/finally
block ensures that if the analysis function raises an exception, the memory thread will still be terminated.
Using this approach, we’re effectively sampling memory usage over time. Most of the work will be done in the main analysis thread; but every 0.1s, the monitor thread will wake up, take a memory measurement, store it if memory usage has increased, and go back to sleep.
The performance overhead of this sampling approach is minimal. Although sampling every 0.1 seconds might sound like a lot, it’s an eternity in CPU time, and as a result, there is a negligible impact on overall processing time. This sampling rate can be tuned, too; if you do see an overhead, you can increase the pause between samples; or, if you need more precise data, you can decrease the pause.
The downside is that the sampling-based monitoring approach is imprecise. You’re only sampling memory usage, so short-lived memory allocation spikes will be lost in this analysis. However, for the purposes of optimizing cloud resource allocation, we only need rough numbers. We are only looking to answer whether our process is using 8GB or 10GB of RAM, not differentiate at the byte (or even megabyte) level.
Conclusion
It’s impossible to improve something you aren’t measuring. Armed with more information about the memory usage of our analysis tasks, we’re now in a much better position to optimize our resource usage. And, we’ve been able to collect that information with relatively little code and relatively little performance overhead.