Every Python Programmer Should Know the Not-So-Secret ThreadPool
You are just a few lines of code away from speeding up your code by orders of magnitude with multithreading
I first came across the necessity for parallelizing my code with Python when I had to run hundreds of external update operations on our CRM system without the option of batching them.
Each update operation would be submitted via an API call and then take about two to three seconds to process. Those updates would trigger processes in the CRM and sometimes throw errors.
The possibility of errors meant that I had to go through the motions countless times to make sure that everything finished to my satisfaction.
What made this endeavor take so excruciatingly long was the fact that after every single API call, my script would have to wait for a response before submitting the next API request.
A situation like this is a typical use case where multithreading (one concept of parallelism usable in Python) comes in very handy! In Python, there are, in essence, three forms of concurrency:
- Multithreading — pre-emptive, via
- Cooperative multitasking — via
- Multiprocessing — via
The general advice is to use multiprocessing for CPU-bound problems (i.e., computationally intensive) and multithreading/multitasking for I/O-bound problems (i.e., waiting for input/output to finish).
Of course, there might be exceptions and ultimately, it comes down to the individual case at hand. In my experience, it does make sense to look into all options as soon as performance becomes critical.
I set up a web API (AWS API gateway + Lambda) that spits out motivational quotes, which we can “DoS” for benchmarking purposes.
Here’s a sample!
Let’s first have a look at
Pool from the
multiprocessing library. I find
Pool to have an amazingly easy-to-use API. In essence, you just add
Pool in a
with block and already have parallelized your code.
Ridiculously easy if you ask me. Other implementations of concurrency are much more involved.
I.e., they require you to actively manage workers, tasks, executors, queues, coroutines, or whatnot. A little bit too much cognitive overhead for my taste. I really like the simplicity of
“Simple is better than complex.” — The Zen of Python, by Tim Peters
Let’s quickly dive into what exactly a
Pool does. We first instantiate the pool with a specific number of processes (five, in our example). As a rule of thumb (for CPU-bound tasks), use roughly as many processes as you have CPU cores.
Next, we have
p.map(<func>, <iterable>) takes a function and an iterable, pretty much like a regular
However, the main difference compared to
map is that we now have multiple processes working on the iterable in parallel. As soon as a process is done with its current element from the iterable, the process goes back to the iterable and grabs the next one to apply the function.
The “problem” with multiprocessing is that it comes with some overhead, which brings us to multithreading. Threads are lightweight compared to processes and come with significantly less overhead (and allow to share memory more easily between one and another).
The Secret “From multiprocessing.pool Import ThreadPool”
Why secret, you might ask? Well, the thing is,
ThreadPool is not really documented.
However, the interface is the same as
multiprocessing.Pool (note that the import is from
multiprocessing.pool and not just from
Let’s start fetching quotes from the web API in a multithreaded manner!
For Completeness’ Sake — Asyncio
Asyncio seems to be the kinda-sorta-maybe new rave (not to be confused with “ay, se cayó” — Spanish for: “oh, he fell”).
Asyncio was introduced in Python 3.4 but has since evolved quite a bit. I really, really dislike the syntax, but I wanted to include it in the benchmarking.
It’s different from multithreading and multiprocessing in a sense that it only uses one process and one thread but executes its code asynchronously.
The crucial point here is, those asynchronous routines can pause and wait for their result while handing over to other routines (the executing program determines the timing of the context switching). An event-loop facilitates all of this.
And the Winner Is!
Time for benchmarking!
- The sequential version (i.e., one-by-one) is by far the slowest, every 1000 samples add 80 seconds or so.
- Followed by AsyncIO, which is roughly three-five times faster than the sequential version.
- Eight threads already get us a fivefold speed increase.
- 64+ Threads get us whopping 10x! (This is the point where my Mac is starting to cap out. But given the right circumstances like hardware, large enough sample size, internet connection, and a server that can handle your requests, you could go even higher and see positive results.)
Going from 160 seconds to ~15 seconds is quite the achievement, I’d say!
It’s worthwhile to note that on my local machine I was able to bring down the time it took to fetch 2000 samples to roughly 12 seconds via
multiprocessing, which might seem counterintuitive at first (given that this is an I/O-bound problem).
multiprocessing can use all available cores and is not restricted to one core. If you are curious about the overall performance of
multiprocessing check out PEP 371.
If you want to reproduce the results, feel free to have a go at it with the below code. (I only ask you to be a little mindful and not machine-gun the web API into oblivion.)
Thanks for reading!