Memory efficiency of parallel IO operations in Python

Jakub Wolf
code.kiwi.com
Published in
5 min readMar 13, 2018

Python allows for several different approaches to parallel processing. The main issue with parallelism is knowing its limitations. We either want to parallelise IO operations or CPU-bound tasks like image processing. The first use case is something we focused on in the recent Python Weekend* and this article provides a summary of what we came up with.

Before Python 3.5, there were two ways of parallelising IO-bound operations. The native method was to use multithreading and the non-native method involved frameworks like Gevent to schedule concurrent tasks as micro threads. But then Python 3.5 brought native support for concurrency and local threading with asyncio. I was curious to see how each of these would perform in terms of memory footprint. Find out the results below 👇

Prepare a testbed

For this purpose, I created a simple script. Even though the script does not have a lot of functionality, it still demonstrates a real use case. The script downloads bus ticket prices from a webpage 100 days upfront and prepares them for processing. Memory usage was measured with thememory_profiler module. The code is available in this Github repository.

Let’s test!

Synchronous

I executed a single thread version of the script to act as a benchmark for the other solutions. The memory usage was pretty stable throughout the execution and the obvious drawback was the execution time. Without any parallelism, the script took about 29 seconds.

Sequential memory usage

ThreadPoolExecutor

Multithreading is part of the standard library toolbox. With Python 3.5, it is easily accessible via the ThreadPoolExecutor that provides a very simple API to parallelise existing code. However, using threads comes with some drawbacks and one of them is higher memory usage. On the other hand, a significant increase in the speed of execution is the reason we’d want to use it in the first place. The execution time of this test was ~17 sec. That’s a big difference compared to ~29 sec for synchronous execution. The difference is a variable affected by the speed of IO operations. In this case network latency.

ThreadPoolExecutor memory usage with 20 threads.

Correction

nickcw from Hacker News pointed out that when the max_workers parameter is None, the number of threads is equal to the number of processors on the machine multiplied by 5. In my case, that means 20 threads. After setting this to 100 threads (to match the number of requests) the memory usage is much higher:

ThreadPoolExecutor memory usage with 100 threads.

Gevent

Gevent is an alternative approach to parallelisation and it brings coroutines to pre Python 3.5 code. Under the hood it takes advantage of small, independent pseudo-thread “Greenlets”, but also spawns some threads for internal needs. The overall memory footprint is very similar to multithreading.

Pseudo-thread memory usage

Asyncio

Since the release of Python 3.5, coroutines are now possible with the asyncio module which is part of the standard Python library. To take advantage of asyncio I used aiohttp instead of requests. aiohttp is an async equivalent of requestswith the same functionality and similar API.

In general, this is a point to consider before starting a project in async, although most of the popular IO related packages — requests, redis, psycopg2 — have their equivalents in the async world.

Coroutine memory usage (asyncio)

With asyncio, memory usage is significantly lower compared to the previous methods. It’s very close to a single thread version of the script without parallelisation.

So should we start using asyncio?

Parallelism is a very efficient way of speeding up an application that has a lot of IO operations. In my case, there was a ~40% speed increase compared to sequential processing. Once a code runs in parallel, the difference in speed performance between the parallel methods is very low. An IO operation heavily depends on the performance of the other systems (i.e. network latency, disk speed, etc). Therefore, the execution time difference between the parallel methods is negligible.

ThreadPoolExecutor and Gevent are very powerful tools that can speed up an existing application. One major advantage is that in most cases it requires only minor changes in the codebase. When it comes to overall performance, the best performing tool is asyncio with its local threads. The memory footprint is much lower compared to other parallel methods without impacting the overall speed. It comes with a price though, the codebase and its dependencies have to be specifically designed for use with asyncio. This is something that has to be considered when moving a codebase to coroutines.

At Kiwi.com we use asyncio in high performing APIs where we want to achieve speed with a low memory footprint on our infrastructure. An example of an “asyncio service” running at Kiwi.com is our public API for geographical locations data. You can try using the service yourself and the documentation is available here.

*Kiwi.com Python weekend

Several times a year, we organise an intense coding Dojo for anyone who knows a bit about Python and wants to try using it in real applications. During the weekend we help participants build applications. We show them our approach to software development, some of the useful tools we use and how they can move forward with Python.

--

--