Why Parallelism In Python Pandas May Be Hurting The Performance Of Your Programs

Published in

affinityanswers-tech

5 min readMay 31, 2021

Python Pandas is a very popular and powerful tool used in data analysis and manipulation. We at AffinityAnswers use it extensively in Data Analytics and Data Science in order to clean datasets as well as use the data in various statistical and analytical operations.

The focus of this story is to look at the combination of Python multiprocessing with the Pandas module. Pandas has a complicated relationship with multiprocessing, since a large part of Pandas’ repertoire consists of “heavy-duty” functions — functions that require a large amount of processing power and memory (of course, this depends on how much data the function is working on).

To speed up things, Pandas chooses to internally parallelise certain operations. This seems like a good thing, right? Generally speaking, yes, this does help speed up our programs and get the best performance out of the system. However, what happens when we want to parallelise our program that uses Pandas?

When we parallelise our program, and Python further splits these parallel programs into multiple threads, the number of threads created becomes greater than the computing capacity of the system. This leads to a phenomenon known as thrashing, where all the threads compete for processing time, and thus, bring down the performance of the program.

In the following diagram, you can see the relationship between the degree of multiprocessing (number of threads or processes) and CPU Utilization.

This can result in no improvement in compute times for parallelised programs, or in some cases, actually, increase the amount of time it takes for a parallelised program to run. This is the problem of performance plateauing sometimes caused by implicit parallelism in Python Pandas.

There is a very simple solution to this problem. All we have to do is tell Pandas to stop internally parallelising operations. The easiest way to do this is to set the following environment variable in the runtime environment of your program.

OMP_NUM_THREADS=1

Here are a few links to help figure out how to set environment variables in your OS:

Here is an illustration of how this solution works, as shown by user “stuarteberg” on GitHub. You can see their full post using the link at the end of this story. They used the following code snippet in order to test performance:

# test.py
import numpy as np
import pandas as pd

a = np.random.randint(10, size=20_000_000)
s = pd.Series(a)

for _ in range(50):
    c = s.value_counts(sort=True)

Here is the output of the above file, when run with different values of OMP_NUM_THREADS. As shown in the following snippet, the difference between “real” and “user” times shows us that the OMP_NUM_THREADS environment variable can control the implicit parallelism in Pandas.

$ time OMP_NUM_THREADS=1 python test.py

real	0m8.714s
user	0m6.715s
sys	0m1.653s

$ time OMP_NUM_THREADS=10 python /test.py

real	0m7.921s
user	0m58.198s
sys	0m7.345s

In conclusion, it’s a single step solution to solve the problem of implicit parallelism in Pandas. However, there’s another aspect to the problem that I will be taking a look at.

While solving a problem may seem like a hard enough task, sometimes it is even harder to figure out that a problem actually exists. Implicit parallelism is a prime example of this. While easy to solve once you locate the issue, figuring out that implicit parallelism is the issue is not as simple.

There are a number of factors that may cause implicit parallelism to be an issue. Instead of taking a look at each of these, it is easier to simply see if your program is being limited by the presence of too many threads.

Here is a link to a post on the GitHub repository for Python Pandas which lists out what functions of Pandas cause implicit parallelism to be used. If your program uses any of these functions, then it is almost certain that your program will be limited by further parallelisation.

If you’ve written a program and don’t want to go through a bunch of functions to figure out if it will be affected by parallelisation, there is another method to check whether your program is resulting in thrashing. This involves checking the number of processes or threads created internally by your program. This method only works in Linux and OSX, although there are similar methods for other operating systems too.

The tool that we will be using is called “htop”. This tool can easily be installed using:

sudo apt install htop
sudo yum install htop

The output or the interface of the htop command is as shown above. It is similar to the task manager on windows with a couple of extra features. It shows you the number of CPU cores that your system has at the top along with a usage gauge for each, as well as the amount of RAM being used/available. Below this you can see information for all currently running processes. Finally, there is a menu in order to help you with other operations.

We can use htop in order to figure out how many processes or threads are running on the system. If the number of threads/processes being run by your program is greater than what was intended, it is likely that Pandas is implicitly parallelising your program and causing a reduction in performance.

As is visible in the above image, there are five processes being run under the command “python3 testpandas.py”. One of these is the master process, and has spawned the other four processes in order to perform some function. The two processes in green are monitoring threads, and can be ignored. The four processes that are performing the operations of testpandas.py each use one core of my machine, and therefore, I am able to use 100% of my system power. However, if this number went up, that would result in competition among the processes and, hence, thrashing.

Check out this guide if you want more information on htop: https://spin.atomicobject.com/2020/02/10/htop-guide/

Also, if you want more information on Pandas and implicit parallelism in Pandas, check out this issue on their GitHub repository:
https://github.com/pandas-dev/pandas/issues/23139

Why Parallelism In Python Pandas May Be Hurting The Performance Of Your Programs

Written by Shreyas Sridhar