PYTHON PROGRAMMING TIPS

Pandas 2.0 vs Polars: The Ultimate Battle

In-Depth analysis in terms of Syntax, Speed, and Usability between Pandas 2.0 and Polars 0.17.0

Priyanshu Chaudhary
CueNex

--

Polars vs Pandas 2.0. Image by author

Introduction

When dealing with enormous datasets, most of us have experienced the agony of sitting for hours while our Pandas code ran. Polars library comes in handy here.

Polars is a lightning-fast library that can handle data frames significantly more quickly than Pandas.

We’ve seen huge releases in these two libraries in the last month that claim significant speed gains. Pandas 2.0, which was just launched, supports Apache Arrows, leading to performance enhancements. Several fundamental operations, however, continue to be faster on NumPy arrays. Polaris 0.17.0, which was released last week, has also seen performance improvements[1].

But what makes Polars faster and more efficient than Pandas? Image by author
  1. Built-in Support for Rust: Polars is written in Rust. Rust’s ability to be immediately compiled into machine code without the use of an interpreter can make it faster than Python.
  2. Parallelization: Polars leverage multithreading. It allows for vectorized operations that can be executed in parallel on multiple CPU cores.
  3. Python for Interface: Polars can be used as a Python library, providing an easy-to-use interface for data processing while leveraging the performance benefits of Rust.
  4. Lazy evaluation: Polars supports two APIs lazy as well as eager evaluation(used by pandas). In lazy evaluation, a query is executed only when required. While in eager evaluation, a query is executed immediately.

In this post, I’ll be spilling the beans on

  1. Comparing the speed of both Pandas 2.0 (with Numpy and Pyarrow as backend) and Polars 0.17.0.
  2. How to write simple to complex Pandas code effortlessly in the groundbreaking Polars library.

We’ll be conducting an exciting performance showdown between the two libraries on 4-CPU core processors and 32 GB RAM. Get ready to take your data analysis skills to new heights!

Getting Started

In case you don’t have the library on your local machine. It can be installed via the pip command

pip install polars==0.17.0 # Latest version

pip install pandas==2.0.0 # Latest pandas version

In order to assess performance, we will be using a synthetic dataset comprised of 30 million rows and 15 columns. The dataset is composed of 8 categorical and 7 numerical features and has been artificially generated. The dataset can be accessed here.

A sample of the dataset is shown below

# Pandas
train_pd.head()

#Polars
train_pl.head()
Sample dataset for speed and syntax analysis

Import the Relevant libraries

import pandas as pd
import polars as pl
import numpy as np
import time

Reading the dataset

Let’s compare the parquet file reading times of both libraries. I have used the code below and used %%time to get the time of execution of codes.

train_pd=pd.read_parquet('./train.parquet') #Pandas dataframe

train_pl=pl.read_parquet('./train.parquet') #Polars dataframe
Reading time comparison. Image by author

When it comes to reading parquet files, Polars and Pandas 2.0 perform similarly in terms of speed. However, Pandas (using the Numpy backend) takes twice as long as Polars to complete this task.

Aggregation Operations

To evaluate aggregation functions, use the code below, which gives a generic code for scenarios when aggregations (min, max, mean) must be calculated for many columns.

# pandas query
train_pd[nums].agg(['min','max','mean','median','std')
train[cats].agg(['nunique'])

# Polars query
train_pl.with_columns([
pl.col(nums).min().suffix('_min'),
pl.col(nums).max().suffix('_max'),
pl.col(nums).mean().suffix('_mean'),
pl.col(nums).median().suffix('_median'),
pl.col(nums).std().suffix('_std'),
pl.col(cats).nunique().suffix('_unique'),
])
Aggregation function time comparison. Image by author

Pandas are definitely better in syntax and performance for simple queries, though the performance difference is quite minimal. It must be noted that Polars can work with a list of features that are to be aggregated using the same aggregation. whereas in the case of pandas that is not the case for the situation coded above, hence I have defined the two data frames for comparison.

Filtering and Selection operations

Selection and filter operations involve specifying a condition for extraction of the database on which the data frame is presented. In order to understand this consider the below queries

Query 1: Count unique values for categorical columns when nums_8 is smaller than 10.

# Polars filter and select
train_pl.filter(pl.col("num_8") <= 10).select(pl.col(cats).n_unique())

# Pandas filter and select
train_pd[train_pd['num_8']<=10][cats].nunique()

Query 2: The mean of all numerical columns when cat_1 equals 1.

# Polars filter and select
train_pl.filter(pl.col("cat_1") == 1).select(pl.col(nums).mean())

# Pandas filter and select
train_pd[train_pd['cat_1']==1][nums].mean()
Selection and filtering time comparison. Image by author

In terms of performance, Polars is 2–5 times faster for numerical filter operations, whereas Pandas requires less code to be written. It must be noted that Pandas is slower when it comes to working with strings (categorical features).

Grouping Operations

Grouping operation is one of the essential features that is used to create aggregation features in machine learning, knowing the statistics of the data.

I have tested the performance of these libraries on functions I have defined below

Function 1: count aggregated features for cat_1

Function 2: Mean feature for num_7

Function 3: Mean aggregated features for all numerical columns

Function 4: count aggregated features for categorical columns

nums=['num_7','num_8', 'num_9', 'num_10', 'num_11', 'num_12', 'num_13', 'num_14','num_15']
cats=['cat_1', 'cat_2', 'cat_3', 'cat_4', 'cat_5', 'cat_6']

# Pandas Functions
Function_1= train_pd.groupby(['user'])['cat_1'].agg('count') #Function 1
Function_2= train_pd.groupby(['user'])['num_7'].agg('mean') #Function 2
Function_3= train_pd.groupby(['user'])[nums].agg('mean') #Function 3
Function_4= train_pd.groupby(['user'])[cats].agg('count') #Function 4


# Polars Functions
Function_1= train_pl.groupby('user').agg(pl.col('cat_1').count()) #Function 1
Function_2= train_pl.groupby('user').agg(pl.col('num_7').mean()) #Function 2
Function_3= train_pl.groupby('user').agg(pl.col(nums).mean()) #Function 3
Function_4= train_pl.groupby('user').agg(pl.col(cats).count()) #Function 4
Comparison of groupby functions. Image by author

Polars appear to be the best choice for Grouping and aggregations. It should also be observed that Pandas 2.0 with the Pyarrow backend is noticeably slower in all circumstances than both Polars and Pandas 2.0 (numpy backend).

Next, let’s consider how the time changes as the number of grouping variables are raised from one to five, as coded below.
Pandas 2.0 with Pyarrow support is not shown in the diagram since the expression evaluation takes more than 1000 seconds.

# PANDAS: TESTING GROUPING SPEED ON 5 COLUMNS
for cat in ['user', 'cat_1', 'cat_2', 'cat_3', 'cat_4']:
cols+=[cat]
st=time.time()
temp=train_pd.groupby(cols)['num_7'].agg('mean')
en=time.time()
print(cat,':',en-st)


# POLARS: TESTING GROUPING SPEED ON 5 COLUMNS
for cat in ['user', 'cat_1', 'cat_2', 'cat_3', 'cat_4']:
cols+=[cat]
st=time.time()
temp=train_pl.groupby(cols).agg(pl.col('num_7').mean())
en=time.time()
print(cat,':',en-st)
del temp
Grouping for one or more than one variable. Image by author

The Polars have won again! Pandas 2.0 (Numpy Backend) evaluates grouping functions more slowly. whereas Pyarrow support for Pandas 2.0 is taking greater than 1000 seconds.

Note that Pandas by default remove null values while grouping, whereas Polars library doesn't.

Sorting Operation

We can quickly sort a data frame based on one or more columns, either in ascending or descending order, using the below codes.

cols=['user','num_8'] # columns to be used for sorting

#Sorting in Polars
train_pl.sort(cols,descending=False)

# Sorting in Pandas
train_pd.sort_values(by=cols,ascending=True)
Sorting time comparison. Image by author

Polars is still the fastest library for complex cases like sorting, and grouping as can be seen from the bar charts. While Pandas can take minutes of time to simply sort the data frame complex sorting functions can be evaluated in not more than 15 seconds in polars.

Conclusion

In conclusion, choosing the right tool for working with large datasets is essential for data scientists. Our exploration of the performance differences between Pandas and Polars highlights the benefits and drawbacks of each library. While Pandas is more syntactically appealing, Polars offers better throughput when working with larger data frames.

It’s important to note that since Polars is a newer library, some people may find it challenging to transition from Pandas. However, the decision to use either library ultimately depends on the size of your data and how crucial performance is for your work. By understanding the differences between these two powerful tools, data scientists can choose the best option for their specific needs and achieve more efficient and effective data analysis.

References:

[1] https://github.com/pola-rs/polars/releases/tag/py-0.17.0

[2] https://www.pola.rs/

[3] https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html

--

--

Priyanshu Chaudhary
CueNex
Writer for

Competitions Master @Kaggle.com, Machine Learning @Expedia