Boosting Python Pandas Performance: Harnessing the Power of Parallel Computing
Python Pandas could be accelerated by the parallel calculating Pandaral.lel
Try IBM Watson Studio to run the code in this article!
Introduction
As discussed in our previous blog article, one of the factors contributing to Pandas’ relatively lower efficiency compared to other Python libraries, such as Polars, is its inability to leverage parallel cores for calculations. Therefore, if your system includes multiple cores (CPU cores), can we accelerate the Pandas by parallel calculation?
Of course! pandaral.lel could help!
Pandaral.lel provides a simple way to parallelize your pandas operations on all your CPUs by changing only one line of code. It also displays progress bars.
Installation & Initiation
Similar to many other Python packages, you can use “pip install” to install Pandarallel:
pip install pandarallel
After installing Pandarallel, you can import both the Pandas, Pandarallel, also the timer packages into your Python project to start leveraging the benefits of parallel computing for your Pandas operations.
import pandas as pd
from pandarallel import pandarallel
import time
To enable Pandarallel to utilize parallel computing, you’ll need to initialize multiple cores first.
pandarallel.initialize()
For instance, if your system includes 10 cores, you can specify the number of workers to be used as 10. This will allow Pandarallel to parallelize your Pandas operations across all 10 cores, significantly reducing the time needed to complete these operations.
Example & Speedtest
Pandarallel can be applied to both DataFrame and Series objects in Pandas. For this article, we’ll be focusing on the application of Pandarallel functions to DataFrames. Specifically, we’ll be exploring how the Pandarallel functions apply(), and groupby.apply() can be used to parallelize various Pandas operations.
In addition to apply() and groupby.apply(), Pandarallel can also be used with other operations such as , applymap(), agroupby.apply(), groupby.rolling.apply(), and groupby.expanding.apply(). These functions allow you to apply parallel computing to a range of common Pandas operations, including rolling window and expanding window calculations, as well as more complex groupby() operations.
apply()
Before we can apply any Pandarallel functions, we’ll need to define the function we want to apply to our DataFrame.
The function specifies the calculation or operation we want to perform on our data. Then, we can use Pandarallel to parallelize its application across multiple cores, significantly reducing the time required to execute the operation.
Here, we set 100000 sets to calculate:
df_size = int(1e5)
df = pd.DataFrame(dict(a=np.random.randint(1, 8, df_size),
b=np.random.rand(df_size)))
def func(x):
return x**2+x*3+x/2+5
After defining the function we want to apply, we can compare the time required to execute the operation using both normal Pandas apply() and Pandarallel apply().
This will allow us to evaluate the performance benefits of using parallel computing for our specific operation, and determine whether the use of Pandarallel is worth the additional overhead required to set up parallelization.
By benchmarking both methods, we can gain a better understanding of the potential performance gains that can be achieved through parallel computing.
%%time
res = df.apply(func, axis=1)
%%time
res_parallel = df.parallel_apply(func, axis=1)
Based on our benchmarks, we observed that using Pandarallel for our specific operation resulted in a significant performance boost. Whereas the normal Pandas apply() operation took 12.3 seconds to execute, the same operation using Pandarallel took only 1.51 seconds.
groupby.apply()
As with apply() and applymap(), we can also compare the performance of normal Pandas groupby() with Pandarallel groupby().
By benchmarking both methods, we can determine whether parallel computing provides a performance boost for this specific operation, and assess the potential benefits of using Pandarallel more broadly in our data science work.
df_size = int(2e8)
df = pd.DataFrame(dict(a=np.random.randint(1, 1000, df_size),
b=np.random.rand(df_size)))
def func(df):
dum = 0
for item in df.b:
dum += item**2
return dum / len(df.b)
%%time
res = df.groupby("a").apply(func)
%%time
res_parallel = df.groupby("a").parallel_apply(func)
Based on our benchmarks, we observed that using Pandarallel for groupby() operations resulted in a notable performance boost. Whereas the normal Pandas groupby() operation took 36.2 seconds to execute, the same operation using Pandarallel took only 27.5 seconds.
The results of apply() and groupby() represent a substantial improvement in performance, and highlight the potential benefits of using parallel computing for large-scale data operations. By leveraging the power of multiple cores, Pandarallel allows us to perform complex calculations and analyses more efficiently and effectively, helping us to unlock the full potential of Pandas and take our data science skills to the next level.
Additionally, we can also compare the performance of Pandarallel groupby.rolling.apply() and Pandarallel groupby.expanding.apply() with their normal Pandas counterparts to evaluate the benefits of parallel computing for these operations.
Conclusion
Now that we’ve explored how to use Pandarallel with Pandas, you should have a better understanding of how parallel computing can help to optimize your Pandas operations!
By leveraging the power of multiple cores, you can significantly reduce the time required to execute complex calculations and analyses, allowing you to work more efficiently and effectively with large datasets.
With Pandarallel, you can unlock the full potential of Pandas, and take your data science skills to the next level.
You can also learn more courses in data science at Cognitiveclass.ai!
In Cognitive class, you will have a friendly learning experience using the free cloud coding platform + detailed instruction & video simultaneously.
CognitiveClass offers a variety of free courses and projects, all of which come equipped with a user-friendly coding platform to facilitate your learning experience. With CognitiveClass, you can access a wealth of valuable resources to help you advance your skills and stay up-to-date with the latest trends in technology.