Speed Up Pandas Performance

AC
Data Folks Indonesia
3 min readDec 4, 2023
Generated by SDXL

Pandas is a popular tool for data analyst and scientists working with large datasets. Pandas has a variety of features and is known for simplicity, efficiency and clarity. It provides high-performance, easy-to-use, and interface using DataFrame. The DataFrame allows users to manipulate and analyze tabular data seamlessly.

When dealing with large datasets, pandas excels in handling complex data types, performing operation such as filtering, grouping, aggregation with remarkable speed.

Talking about speed, we often use apply() or simply looping to a set of instruction for cleaning or aggregating. I tried a couple of experient on how to speed up Pandas performance.

Here we have 3 options:

  1. Series which a type natively from Pandas
  2. Numpy ndarray
  3. Just In Time (JIT) Compiler called Numba

When we iterate using apply() it means we undirectly iterate over rows / columns to aggregate. The result may be fast, instant, within the blink of an eye if the dataset is small-medium size.

But, if you are dealing with millions rows and tens of columns, you start to wait the process to complete.

I began the experiment using random genrated numbers with the shape of (100000, 5) and apply the data to window function rolling() and custom function softmax.

Import Libraries

import pandas as pd
import numpy as np
import numba

Check module version

Generate random numbers, here is 100K rows and 5 columns

X = np.random.rand(100000, 5)
d_X = pd.DataFrame(X, columns=[f'c_{i}' for i in range(5)])

Apply rolling() window function

roll = d_X.rolling(5)

Then we create function moving average

def moving_avg(x):
return x.mean()

Here’s the entire code to benchmark using Pandas iteration, Numba, and Cython

Pandas Series

%timeit -n 1 -r 1 roll.apply(moving_avg)

Pandas using Numpy

%timeit -n 1 -r 1 roll.apply(moving_avg, raw=True)

Numba

%timeit -n 1 -r 1 roll.apply(moving_avg, engine='numba', raw=True)

%timeit -n 1 -r 1 roll.apply(moving_avg, engine='numba', raw=True)

Numba multithreads

numba.set_num_threads(4)

%timeit -n 1 -r 1 roll.apply(moving_avg, engine='numba', raw=True, engine_kwargs={"parallel": True})

%timeit -n 1 -r 1 roll.apply(moving_avg, engine='numba', raw=True, engine_kwargs={"parallel": True})

Cython

%timeit -n 1 -r 1 roll.apply(moving_avg, engine='cython')

%timeit -n 1 -r 1 roll.apply(moving_avg, engine='cython', raw=True)

The result is eye opening. By passing argument raw=True you can save so much time almost 17 times faster.

Custom Function

I tried to create a custom function for applying softmax, this function is commonly used in deep learning as activation function.

def softmax(logit):
e_x = np.exp(logit)
result = e_x / e_x.sum()
return result.argmax()

%timeit -n 1 -r 1 d_X.apply(softmax, axis=1)

%timeit -n 1 -r 1 d_X.apply(softmax, axis=1, raw=True)

@numba.njit
def softmax(logit):
e_x = np.exp(logit)
result = e_x / e_x.sum()
return result.argmax()

@numba.njit
def apply_softmax(arr):
labels = []
for row in arr:
label = softmax(row)
labels.append(label)

return labels

%timeit -n 1 -r 1 apply_softmax(d_X.to_numpy())

Using numba speed up 2x more faster.

Numba with Parallel

I tried to parallelize the function but unfortunately it showed with warning. I haven’t figure it out why and how to solve the warning.

Conclusion

This article shows you how to speed up performance in a brief and stright forward way. I hope you can reproduce on what I did in this article. And try to learn another way to process the dataframe faster.

I find myself enjoy to explore this kind of thing such as speed up, rare parameter functions, etc. If you find this article helpful, hit a clap button and follow me for more. 🍻

--

--