Member-only story
6 Steps to Make this Pandas Dataframe Operation 100 Times Faster
Cython for Data Science: Combine Pandas with Cython for an incredible speed improvement
In this article you’ll learn how to improve Panda’s df.apply() function to speed it up over 100x. This article takes Pandas’ standard dataframe.apply
function and upgrades it with a bit of Cython to speed up execution from 3 minutes to under 2 seconds. At the end of this article you’ll:
- understand why df.apply() is slow
- understand how to speed up the apply with Cython
- know how to replace the apply by passing the array to Cython
- be able to multi-process your Cython-function to squeeze out the maximum amount of speed
- Annoy your coworkers with the fact that your code is aways so much faster than theirs
Before we begin I highly recommend reading this article about why Python is so slow; it helps you understand the type of problem we’re trying to solve in this article. You can also check out this article about getting started with Cython.
Isn’t Pandas already pretty fast?
True, it’s built on Numpy which is written in C. It’s very fast. Df.apply, however, applies a Python…