A Rising Library Beating Pandas in Performance

Comparing the performance of pypolars and pandas

Ezz El Din Abdullah
Nov 17 · 4 min read
Image for post
Image for post
Photo by shiyang xu on Unsplash

pandas was initially released in 2008 written in Python, Cython, and C. Today, we’re comparing the performance of this well-known library with pypolars, a rising DataFrame library written in Rust. We compare the two while sorting and concatenating a 25Mil-record data and also when joining two CSVs.

Downloading Reddit Usernames data

Let’s first download a CSV file that contains ~26 million reddit usernames from Kaggle: https://www.kaggle.com/colinmorris/reddit-usernames

And let’s form another CSV file that we will use, you can create it with your favorite text editor or through the command line:

$ cat >> fake_users.csv
author,n
x,5
z,7
y,6

Sorting

Now, let’s compare the sorting algorithm of the two:

pandas

$ python sort_with_pandas.py 
Time: 34.35690135599998

pypolars

$  python sort_with_pypolars.py 
Time: 23.965840922999973

~24 seconds meaning pypolars here is 1.4x faster than pandas

Concatenation

We’ll see now how each will perform while concatenating two data frames and stacking them into one data frame

pandas

18 seconds taken by pandas

$ python concat_with_pandas.py 
Time: 18.12720185799992

pypolars

Here pypolars is 1.2x faster

$ python concat_with_pypolars.py 
Time: 15.001723207000055

Joining

Downloading COVID Tracking data

Downloading data from COVID Tracking Project from this command:

$ curl -LO https://covidtracking.com/data/download/all-states-history.csv

will get you the latest data of the coronavirus spread across all the US in this all-states-history.csv file

Downloading states data

This is a CSV file indicating the abbreviations of each state since we need it to join with the previous CSV which has only the abbreviations listed in state column. Let’s get this data from this command:

$ curl -LO https://gist.githubusercontent.com/afomi/8824ddb02a68cf15151a804d4d0dc3b7/raw/5f1cfabf2e65c5661a9ed12af27953ae4032b136/states.csv

This will output states.csv, the file that has two columns: State and Abbreviation

pandas

$ python join_with_pandas.py 
Time: 0.9691885059996821

Let’s use csvcut to filter out this resulted joined_pd.csv file:

$ csvcut -c date,state,State joined_pd.csv | head | csvlook 
| date | state | State |
| ---------- | ----- | ----------- |
| 2020-11-16 | AK | ALASKA |
| 2020-11-16 | AL | ALABAMA |
| 2020-11-16 | AR | ARKANSAS |
| 2020-11-16 | AS | |
| 2020-11-16 | AZ | ARIZONA |
| 2020-11-16 | CA | CALIFORNIA |
| 2020-11-16 | CO | COLORADO |
| 2020-11-16 | CT | CONNECTICUT |
| 2020-11-16 | DC | |

Looks like the join is working and it’s left join. If you’re curious why the associated State values of AS and DC are nulls, that’s because there are no abbreviations existing in the states.csv file itself. If you look into the Abbreviation column, you’ll find no values for AS nor DC.

Here no AS abbreviations:

$ grep AS states.csv 
ALASKA,AK
ARKANSAS,AR
KANSAS,KS
MASSACHUSETTS,MA
NEBRASKA,NE
TEXAS,TX
WASHINGTON,WA

and here no values for DC:

$ grep DC states.csv

P.S. if it’s not clear what csvcut is used for; we have some tutorials on csvkit (the command-line utility that contains csvcut and some other useful command-line tools for cleaning, processing, and analyzing CSVs).

pypolars

$ python join_with_pypolars.py 
Time: 0.41163978699978543

Let’s see now how the joined data frame looks like:

$ csvcut -c date,state,State joined_pl.csv | head | csvlook 
| date | state | State |
| ---------- | ----- | ----------- |
| 2020-11-16 | AK | ALASKA |
| 2020-11-16 | AL | ALABAMA |
| 2020-11-16 | AR | ARKANSAS |
| 2020-11-16 | AZ | ARIZONA |
| 2020-11-16 | CA | CALIFORNIA |
| 2020-11-16 | CO | COLORADO |
| 2020-11-16 | CT | CONNECTICUT |
| 2020-11-16 | DE | DELAWARE |
| 2020-11-16 | FL | FLORIDA |

so it looks like here pypolars missed the null values for the column it joined on, but that’s because the default join is inner join unlike pandas’ default join which is left join. To get the same result as pandas you need to change line 8 to:

df_all_states_history.join(df_states, left_on=”state”, right_on=”Abbreviation”, how=”left”).to_csv(“joined_pl.csv”)

which on my machine got ~317ms meaning here:

pypolars is 3x faster in left join

Final thoughts

In the end, we’ve found how performant pypolars is compared to pandas. Of course, pandas is more mature since it’s been 12 years now and the community is still investing in it, but if more collaborations are made on pypolars; this rising library will rock!

The Startup

Medium's largest active publication, followed by +729K people. Follow to join our community.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app