6 ways to significantly speed up Pandas with a couple lines of code. Part 1
Pandas has long been an indispensable tool for any developer thanks to a simple and understandable API, as well as a rich set of tools for cleaning, researching and analyzing data. And everything would be fine, but when it comes to data that does not fit into RAM or require complex calculations, pandas performance is not enough.
In this article, I will not describe qualitatively different approaches to data analysis, such as Spark or DataFlow. Instead, I will describe six interesting tools and demonstrate the results of their use:
This tool directly accelerates Python itself. Numba is a JIT compiler that likes loops, mathematical operations and Numpy, which is a Pandas core lib. Let’s check in practice what advantages it gives.
We will simulate a typical situation — you need to add a new column by applying some function to the existing one using the
import pandas as pd
import numpy as np
import numba# create a table of 100,000 rows and 4 columns filled with random numbers from 0 to 100
df = pd.DataFrame(np.random.randint(0,100,size=(100000, 4)),columns=['a', 'b', 'c', 'd'])# function for creating new col
return x * 5
# optimized version of this function
return x * 5
As you can see, you do not need to change anything in your code. Just add a decorator. Now look at the results:
# our function
In : %timeit df['new_col'] = df['a'].apply(multiply)
23.9 ms ± 1.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops…