Speed Up Pandas Performance
Pandas is a popular tool for data analyst and scientists working with large datasets. Pandas has a variety of features and is known for simplicity, efficiency and clarity. It provides high-performance, easy-to-use, and interface using DataFrame. The DataFrame allows users to manipulate and analyze tabular data seamlessly.
When dealing with large datasets, pandas excels in handling complex data types, performing operation such as filtering, grouping, aggregation with remarkable speed.
Talking about speed, we often use apply()
or simply looping to a set of instruction for cleaning or aggregating. I tried a couple of experient on how to speed up Pandas performance.
Here we have 3 options:
- Series which a type natively from Pandas
- Numpy ndarray
- Just In Time (JIT) Compiler called Numba
When we iterate using apply()
it means we undirectly iterate over rows / columns to aggregate. The result may be fast, instant, within the blink of an eye if the dataset is small-medium size.
But, if you are dealing with millions rows and tens of columns, you start to wait the process to complete.
I began the experiment using random genrated numbers with the shape of (100000, 5) and apply the data to window function rolling()
and custom function softmax.
Import Libraries
import pandas as pd
import numpy as np
import numba
Check module version
Generate random numbers, here is 100K rows and 5 columns
X = np.random.rand(100000, 5)
d_X = pd.DataFrame(X, columns=[f'c_{i}' for i in range(5)])
Apply rolling()
window function
roll = d_X.rolling(5)
Then we create function moving average
def moving_avg(x):
return x.mean()
Here’s the entire code to benchmark using Pandas iteration, Numba, and Cython
Pandas Series
%timeit -n 1 -r 1 roll.apply(moving_avg)
Pandas using Numpy
%timeit -n 1 -r 1 roll.apply(moving_avg, raw=True)
Numba
%timeit -n 1 -r 1 roll.apply(moving_avg, engine='numba', raw=True)
%timeit -n 1 -r 1 roll.apply(moving_avg, engine='numba', raw=True)
Numba multithreads
numba.set_num_threads(4)
%timeit -n 1 -r 1 roll.apply(moving_avg, engine='numba', raw=True, engine_kwargs={"parallel": True})
%timeit -n 1 -r 1 roll.apply(moving_avg, engine='numba', raw=True, engine_kwargs={"parallel": True})
Cython
%timeit -n 1 -r 1 roll.apply(moving_avg, engine='cython')
%timeit -n 1 -r 1 roll.apply(moving_avg, engine='cython', raw=True)
The result is eye opening. By passing argument raw=True
you can save so much time almost 17 times faster.
Custom Function
I tried to create a custom function for applying softmax, this function is commonly used in deep learning as activation function.
def softmax(logit):
e_x = np.exp(logit)
result = e_x / e_x.sum()
return result.argmax()
%timeit -n 1 -r 1 d_X.apply(softmax, axis=1)
%timeit -n 1 -r 1 d_X.apply(softmax, axis=1, raw=True)
@numba.njit
def softmax(logit):
e_x = np.exp(logit)
result = e_x / e_x.sum()
return result.argmax()
@numba.njit
def apply_softmax(arr):
labels = []
for row in arr:
label = softmax(row)
labels.append(label)
return labels
%timeit -n 1 -r 1 apply_softmax(d_X.to_numpy())
Using numba speed up 2x more faster.
Numba with Parallel
I tried to parallelize the function but unfortunately it showed with warning. I haven’t figure it out why and how to solve the warning.
Conclusion
This article shows you how to speed up performance in a brief and stright forward way. I hope you can reproduce on what I did in this article. And try to learn another way to process the dataframe faster.
I find myself enjoy to explore this kind of thing such as speed up, rare parameter functions, etc. If you find this article helpful, hit a clap button and follow me for more. 🍻