6 ways to significantly speed up Pandas with a couple lines of code. Part 1

In this article I will tell you about six tools that can significantly speed up your pandas code. For most tools, just install the module and add a couple lines of code.

Magomed Aliev
The Startup
Published in
6 min readMay 25, 2020

--

Pandas has long been an indispensable tool for any developer thanks to a simple and understandable API, as well as a rich set of tools for cleaning, researching and analyzing data. And everything would be fine, but when it comes to data that does not fit into RAM or require complex calculations, pandas performance is not enough.

In this article, I will not describe qualitatively different approaches to data analysis, such as Spark or DataFlow. Instead, I will describe six interesting tools and demonstrate the results of their use:

Chapter 1:

  • Numba
  • Multiprocessing
  • Pandarallel

Chapter 2:

  • Swifter
  • Modin
  • Dask

Numba

This tool directly accelerates Python itself. Numba is a JIT compiler that likes loops, mathematical operations and Numpy, which is a Pandas core lib. Let’s check in practice what advantages it gives.

We will simulate a typical situation — you need to add a new column by applying some function to the existing one using the apply method.

import pandas as pd
import numpy as np
import numba
# create a table of 100,000 rows and 4 columns filled with random numbers from 0 to 100
df = pd.DataFrame(np.random.randint(0,100,size=(100000, 4)),columns=['a', 'b', 'c', 'd'])
# function for creating new col
def multiply(x):
return x * 5

# optimized version of this function
@numba.vectorize
def multiply_numba(x):
return x * 5

As you can see, you do not need to change anything in your code. Just add a decorator. Now look at the results:

# our function
In [1]: %timeit df['new_col'] = df['a'].apply(multiply)
23.9 ms ± 1.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops

--

--