Parallelism in Python starter pack
Doing many things at the same time is considered highly efficient, at least when we’re talking about computing. If you are a software engineer, you will want to take advantage of this so the applications you build can run faster.
When you run a Hello World file in a terminal window, it runs on a single operating system process on the CPU of your computer. It’s most likely the computer you used has multiple CPUs, so the program you coded will run on one of them.
Let’s say we don’t have a program that does such a simple task as printing Hello World. Instead, our program is way more complex such as making external requests for data, writing large files to disk, or doing expensive mathematical calculations like machine learning. There we have a higher probability that we hit something called blocking. This happens when it stops everything in our logic until we wait for something to happen, like the waiting times I told you before. We can use that time for doing other things instead, reducing the total execution time and minimizing Digital Sobriety (the less time it takes for a program to run, the less electricity it uses, and the less carbon footprint we leave).
We all heard the words Parallelism, Multitasking. Although one can think they mean the same thing, it’s far from the truth. We will explain each of them before jumping into coding.
Parallelism, also named Multiprocessing, is just like opening a new terminal window to run your program again. Maybe with a different set of arguments to get a different output. What happens is that every program runs isolated in a different process at the same time on a different CPU of the same machine. We can achieve this programmatically.
Concurrency or Multitasking, also known as Multithreading, is when the program will take care of running different things (in different threads) at the same time giving the perception of parallelism on a single process by having an inexpensive context switch. Some programming language implementations such as CPython (the most popular implementation of Python) don’t allow multithreading because of the Global Lock Interpreter.
Now that we covered the basics for each type of asynchronous programming we are ready to see some sample implementations. Although the examples are written in Python, the logic can be implemented in any programming language that supports the designated feature. There is tons of literature if you want to dig deeper into these topics.
To start, we will implement the main logic we want to parallelize. A simple function that requests the public GitHub API to get a list of repositories for a given username by using the popular requests library.
In the above example, we use the threading package to create multiple threads that will execute our function to print repositories “at the same time”. Let’s remember that because of the GIL, we only will have one thread running per process (unless you decide to use another Python implementation such as Jython or PyPy). The
joinfunction is used to block the program flow until all the tasks are finished.
The usage is pretty similar to multithreading. Instead, we use the multiprocessing package. Here you can programmatically start multiple processes that run the function on different CPUs. Please take into account the number of CPUs you have, you don’t want to run 1000 processes at the same time with an 8-core processor. If this is the case, you may want to review the Pool class.
A real way to accomplish concurrency in a single process is with the asyncio package (available since Python 3.4). AsyncIO will give you the power of an Eventloop to execute multiple tasks in a pretty straightforward way by using
await keywords. It’s worth mentioning that we should use another HTTP library that supports asynchronous requests, so the
get_repos function won’t serve this purpose. For this, we can recommend aiohttp. Let’s see how we implemented the
get_repos_async function with aiohttp:
This is a special mention that we didn’t want to leave out of this post. Tasks Queues are implementations that allow the distribution of tasks (or jobs) in different processes and even in different machines through dedicated Workers. They use a centralized non-relational key-value in-memory database service to manage the queues to act as a message broker. The most popular one is Redis. There are a lot of Task Queues out there, you can find the one that fits your needs. If you want some recommendations to start testing this, you can check out Celery.
We visited a variety of ways to achieve parallelism and concurrency. They won’t solve any performance problems that the applications usually have. However, it’s important to know these concepts and when to apply them. Even all of the above solutions can bring one or more downsides. The most important thing you should take into consideration is to avoid having race conditions, where multiple tasks are trying to perform operations on a single resource at the same time, causing inconsistencies or unexpected changes. If you handle all of these kinds of edge cases, then you can enjoy the benefits of having a highly performant solution.