Is Pandas really that slow?
Pandas is one of the most useful tools for data science, it makes data exploration, understanding, and manipulation intuitive and easy.
That being said, Pandas seems to have one big drawback, it’s SLOW. Or at least that is how it seems, while doing my first steps into data science I found Pandas to be very useful for data exploration but almost unusable for heavy computational tasks. After understanding Pandas more thoroughly and gaining some experienced I figured out that in most cases, Pandas is anything but slow.
In this article I am going to go through a simple but common example that makes Pandas look slow, but only because it is being used incorrectly.
What makes Pandas appear slow?
As shown in the article above, which I recommend reading, some basic operations such as indexing could be 50-100 times slower when compared to NumPy, which would be one of Pandas' alternatives.
That is somewhat surprising, considering Pandas is built on top of NumPy.
Pandas is not actually a Python package, just like NumPy. Pandas is written in Cython and C which are significantly faster than python and mostly uses python as an API. Therefore treating Pandas dataframes as a usual python data structure disregards the advantages of its optimized C code.
Usual approaches such as for loops and while loops are inefficient because they repeatedly switch between Python and C code. Pandas is slow at retrieving data and Python, generally, is slow at mathematical computations. Constantly retrieving data from Pandas and computing it in Python means using our tools exactly where they are weak.
What is an optimized Pandas code?
An optimized Pandas code does as much computation as possible with the Pandas and NumPy toolsets, using techniques such as the apply method and function vectorization.
So let’s talk about the apply method, one of Pandas' most useful methods. It takes a function or an expression and applies it to either the entire dataframe, all rows, all columns, one value, or any desired combination.
To test it I will demonstrate how using optimized code can reduce the run time of a simple operation from ~22.8 seconds to ~0.001 seconds and even reduce the code length without using any third-party tools.
We will use a Dataframe with 3 columns and 100,000 rows, containing only integers:
import pandas as pd
df = pd.DataFrame(data={'a': list(range(100000)), 'b': list(range(100000, 200000)), 'c': list(range(200000, 300000))})df.shape # returns (100000, 3)
Let's do a simple operation, creating a fourth column that contains the result of multiplying the other three columns by each other.
First, let's use the usual python way, a loop:
for i in df.index:
df.loc[i, 'd'] = df.loc[i, 'a'] * df.loc[i, 'b'] * df.loc[i,'c']
This code works perfectly fine and is fairly simple. Running it takes 22.8 seconds, which is quite slow for such a simple operation on a sizable but not an unusually large dataset.
Now let's do the same operation using a lambda function and the apply method.
If you are unfamiliar with lambda functions I highly recommend researching them, but they basically behave just like normal input-output functions.
df['d'] = df.apply(lambda x: x['a'] * x['b'] * x['c'], axis=1)
This code is also pretty simple, but instead of a 19.1seconds runtime, it only takes 0.95 seconds.
Pandas sends one row at a time (pd.Series object) to the lambda function, which in turn performs the calculation and returns an integer that is being assigned to the new column. The difference is that pandas does this iteration within its own optimized Cython code.
0.95 seconds is reasonable for this operation but still non-optimal.
Now we go into one of Pandas' most powerful tools, vectorization. Pandas has the capability to run calculations on an entire vector instead of single values.
A vector is a one-dimensional array of numbers or objects, in this case, a row or a column. Using vector operations lets Pandas do the entire calculation in its optimized form within the pandas' library.
Here is the same operation from above but in the vectorized form:
df['d'] = df['a'] * df['b'] * df['c']
This code is so simple that it looks like it shouldn't work, and if we were dealing with python array-like objects it wouldn't, but with Pandas it does.
On the same dataframe as before, this calculation took:
- iterating through indexes: ~22.8 seconds
- apply method with lambda function: ~0.9 seconds
- vector operation: ~0.001 seconds
Vectorized computation could be taken much further than this simple use case. For example, vectors can be used only on rows that match certain criteria. Let's do the same operation as before but only on rows with even values.
df['d'] = np.nan # the columns default value
mask = df['a'] % 2 == 0 # choose the rows that are evendf.loc[mask, 'd'] = df.loc[mask, 'a'] * df.loc[mask, 'b'] * df.loc[mask, 'c']
This operation takes ~0.006 seconds.
mask = df['a'] % 2 == 0
The line above is the crucial part, it returns a Pandas Series object of the same length and index as our dataframe with binary values, also known as a truth-table, True in rows that meet the condition and False in rows that don't. When the mask is fed into the df.loc[] method, it returns only the rows that contain True in the mask. In this case, this function is applied:
df['d'] = df['a'] * df['b'] * df['c']
But instead of being applied on the entire dataframe, it is applied only on the rows that meet the predetermined condition (‘a’ is an even number)
Those masks can also work with bitwise operators to check for multiple conditions:
mask = (df['a'] % 2 == 0) & (df['b'] > 150000) | (df['c'] % 10 == 2)
This mask chooses rows in which ‘a’ is an even number and either ‘b’ is greater than 150,000 or ‘c’ divides by 10. The symbols & and | do the exact same thing as the logical operators and, or but on a vector. Just as earlier, in the mask, all rows that meet those conditions have the value True, and the ones that don't have the value False.
If you wish to replicate the results, here is the code to do so:
How to speed up Pandas even further?
There are cases in which optimizing the Pandas code is not sufficient. Some calculations cannot be vectorized and must be done by python itself with either a custom function or another library, When you need to use such computations the biggest bottleneck in the code might be Pandas' slow indexing speeds.
The to_numpy method allows quick access to the underlying NumPy array which might be useful since NumPy is generally faster than Pandas, especially in indexing speeds.
If that still proves insufficient it might be time to start searching for other solutions, for example parallelizing the apply method with 3rd party tools like modin. The effect is limited but can still be very significant. Although parallelizing does not make the code faster, it just makes computers put more effort into running it, with a few major downsides.
Conclusion
The point of this article is to show that in most cases if pandas is too slow for you, you might be using it incorrectly.
Pandas has the capability to run many types of dataset-related operations on a reasonably large scale with highly optimized performance and only a few lines of code, but it is not just another python package and shouldn't be used as one.
There are many guides online on using Pandas correctly and taking full advantage of its feature set, Here are a few of the ones I found the most useful.
- Link- cheat sheet and simple API usage
- Link- enchanting performance, an optimization example by the Pandas dev team
- Link- deeper dive into vectorization in Pandas and NumPy
- Link- 30 minutes lecture from Pycon about optimizing pandas code
- Link1, Link2, Link3- tricks, tips, and third-party tools to speedup Pandas