Pandas vs Polars: Is learning Polars worth the performance boost?

Mochamad Kautzar Ichramsyah
CodeX
Published in
7 min readJan 9, 2025
Illustration: Pandas vs Polars. Generated by ChatGPT

Introduction

In this article, we’ll explore how Pandas and Polars compare in terms of performance, usability, and practicality. From data loading, data cleaning, aggregation, and joining. We’ll dive deep into real-world benchmarks to help us decide which library is better suited for our needs.

Pandas has been a primary option of data analysis in Python for over a decade. It provides an intuitive, user-friendly interface for data manipulation. Its DataFrame and Series objects allow analysts to easily work with structured and tabular data. However, Pandas was designed during a time when data sizes were typically smaller and single-threaded performance was adequate.

Polars, on the other hand, is a relatively new library designed to address modern data processing challenges. Written in Rust, Polars leverages parallel processing to maximize speed and efficiency.

Setup

  • Python version: 3.11.9
  • Pandas version: 2.2.3
  • Polars version: 1.19.0
  • Hardware: MacBook Pro 14-inch, 2021. Apple M1 Pro, Memory 16 GB, MacOS Sequoia 15.1.1
  • Code Editor: Visual Studio Code 1.96.2
import sys
print("Python version: ", sys.version)

import pandas as pd
print("Pandas version: ", pd.__version__)

import polars as pl
print("Polars version: ", pl.__version__)

Performance comparison

In this opportunity, there are 6 things that we want to compare:

  1. Data loading
    Efficient data loading is critical when working with large datasets.
  2. Data transformation
    Common operations like filtering, grouping, joining, and aggregating were tested.
  3. Lazy evaluation
    Polars’s lazy evaluation is one of its standout features.
  4. Multi-threading
    Modern processors thrive on parallelism, and Polars leverages this effectively.
  5. Memory usage
    Efficient memory usage is crucial when working with large datasets to avoid crashes or slowdowns.
  6. Ease of use
    While performance is important, usability often determines adoption in practice.
Summary of the comparison results, based on various sources.

Real-world use cases

Dataset

We’ll use the NYC Yellow Taxi Trip dataset, a real-world dataset available at the NYC Taxi & Limousine Commission. It contains millions of rows and includes trip details like pickup/dropoff locations, timestamps, trip distance, and fare amount.

The initial dataset, 3Mil+ rows, and 19 columns, with memory usage of 444.6+ MB.

Use case 1: Data loading and cleaning.

Task: Load the dataset, filter trips longer than 10 miles, and remove rows with missing values.

import pandas as pd
import polars as pl
import time

file_path = "yellow_tripdata_2023-01.parquet"

pandas_times = []
arrow_times = []
polars_times = []

# Run the tests 10 times
for _ in range(10):
# Pandas execution
start_time = time.time()
df_pd = pd.read_parquet(file_path)
df_pd_cleaned = df_pd[df_pd["trip_distance"] > 10].dropna()
pandas_times.append(time.time() - start_time)

# Pandas execution
start_time = time.time()
df_ar = pd.read_parquet(file_path, dtype_backend="pyarrow")
df_ar_cleaned = df_ar[df_ar["trip_distance"] > 10].dropna()
arrow_times.append(time.time() - start_time)

# Polars execution
start_time = time.time()
df_pl = pl.read_parquet(file_path)
df_pl_cleaned = df_pl.filter(pl.col("trip_distance") > 10).drop_nulls()
polars_times.append(time.time() - start_time)

comparison_table = {
"Trial": list(range(1, 11)),
"Pandas Time (s)": pandas_times,
"pyArrow Time (s)": arrow_times,
"Polars Time (s)": polars_times
}

comparison_df = pd.DataFrame(comparison_table)

comparison_df["Difference Multiplier Pandas to Polars"] = (
comparison_df["Pandas Time (s)"] / comparison_df["Polars Time (s)"]
).round(2)

comparison_df["Difference Multiplier pyArrow to Polars"] = (
comparison_df["pyArrow Time (s)"] / comparison_df["Polars Time (s)"]
).round(2)

comparison_df["Faster Package"] = comparison_df.apply(
lambda row: min(
("Pandas", row["Pandas Time (s)"]),
("Polars", row["Polars Time (s)"]),
("pyArrow", row["pyArrow Time (s)"]),
key=lambda x: x[1]
)[0],
axis=1
)

comparison_df
The data loading and cleaning comparison result table

Based on the data loading and cleaning comparison, Polars is, on average, 2.27 times faster than Pandas and 1.57 times faster than pyArrow when processing data loading and cleaning on a dataset with over 3 million rows, 18 columns, and a memory footprint exceeding 444.6 MB.

Use case 2: Aggregations.

Task: Calculate the average fare amount per passenger count.

import pandas as pd
import polars as pl
import time

pd_times_groupby = []
pl_times_groupby = []

# Run the tests 10 times
for _ in range(10):
# Pandas execution
start_time = time.time()
avg_fare_pd = df_pd.groupby("passenger_count")["fare_amount"].mean()
pd_times_groupby.append(time.time() - start_time)

# Polars execution
start_time = time.time()
avg_fare_pl = df_pl.group_by("passenger_count").agg(pl.col("fare_amount").mean())
pl_times_groupby.append(time.time() - start_time)

comparison_table_groupby = {
"Trial": list(range(1, 11)),
"Pandas Time (s)": pd_times_groupby,
"Polars Time (s)": pl_times_groupby
}

comparison_df_groupby = pd.DataFrame(comparison_table_groupby)

comparison_df_groupby["Difference Multiplier"] = (
comparison_df_groupby["Pandas Time (s)"] / comparison_df_groupby["Polars Time (s)"]
).round(2)

comparison_df_groupby["Faster Package"] = comparison_df_groupby.apply(
lambda row: "Polars" if row["Polars Time (s)"] < row["Pandas Time (s)"] else "Pandas",
axis=1,
)

comparison_df_groupby
The aggregation comparison result table

Based on the aggregation comparison, Polars is, on average, 204.7 times faster than Pandas.

Use case 3: Joining datasets.

Task: Join the NYC Yellow Taxi data with a dataset containing information based on pickup location IDs (132).

In this case, we only use .head(100) for the new datasets due to the high memory needed if want to join the full datasets.

import pandas as pd 
import polars as pl
import time

df_pd_pu132 = df_pd[df_pd['PULocationID']==132].head(100)
df_pl_pu132 = df_pl.filter(pl.col("PULocationID") == 132).head(100)

pd_times_join = []
pl_times_join = []

# Run the tests 10 times
for _ in range(10):
# Pandas execution
start_time = time.time()
join_pd = pd.merge(df_pd, df_pd_pu132, on="PULocationID")
pd_times_join.append(time.time() - start_time)

# Polars execution
start_time = time.time()
avg_fare_pl = df_pl.join(df_pl_pu132, on="PULocationID")
pl_times_join.append(time.time() - start_time)

comparison_table_join = {
"Trial": list(range(1, 11)),
"Pandas Time (s)": pd_times_join,
"Polars Time (s)": pl_times_join
}

comparison_df_join = pd.DataFrame(comparison_table_join)

comparison_df_join["Difference Multiplier"] = (
comparison_df_join["Pandas Time (s)"] / comparison_df_join["Polars Time (s)"]
).round(2)

comparison_df_join["Faster Package"] = comparison_df_join.apply(
lambda row: "Polars" if row["Polars Time (s)"] < row["Pandas Time (s)"] else "Pandas",
axis=1,
)

comparison_df_join
The join comparison result table

Based on data joining comparison, Polars is, on average, 13.05 times faster than Pandas when do join.

Limitations

Pandas

  1. Performance on large datasets: Pandas struggle with very large datasets due to their single-threaded processing and high memory usage.
  2. Memory inefficiency: Pandas uses a row-oriented storage model, which is less efficient than columnar storage for many analytical tasks.
  3. Lack of parallelism: Most operations in Pandas are single-threaded, which means it does not fully utilize modern multi-core processors.
  4. Steep performance degradation for complex operations: Operations like grouping, joining, or applying custom functions over large datasets can be significantly slower compared to more optimized libraries like Polars.

Polars

  1. Smaller ecosystem: Polars is a relatively new library compared to Pandas and does not yet have as extensive an ecosystem or community support. Some functionalities available in Pandas, such as direct integration with certain libraries (e.g., Matplotlib, Scikit-learn), may not be fully supported in Polars.
  2. Limited documentation and tutorials: While Polars’ documentation is improving, it is not as comprehensive or beginner-friendly as Pandas’ documentation.
  3. Customization and flexibility: Although Polars provides excellent performance for standard operations, it may lack the flexibility Pandas offers for highly customized workflows, particularly those requiring complex custom functions.
  4. Compatibility with older systems: Polars is optimized for modern hardware and requires a relatively recent Python version and dependencies. It might not be ideal for legacy systems.

Conclusion

The choice between Pandas and Polars ultimately depends on the specific needs of the data analysis workflow. Both libraries have unique strengths, and understanding their differences can help you make the best decision for each use case.

Key takeaways

  1. Performance and scalability: Polars is a clear winner when it comes to handling large datasets, thanks to its multi-threaded processing, memory-efficient architecture, and columnar storage.
  2. Ease of use: Pandas is ideal for beginners and users who value simplicity and a vast ecosystem of supporting libraries. Its rich documentation and widespread community make it accessible and beginner-friendly.
  3. Advanced features: Polars introduce capabilities such as lazy evaluation, enabling optimization of complex queries before execution. Pandas, while lacking these features, provide a flexible and well-tested environment for handling the most common data manipulation tasks.
  4. Ecosystem and integration: Pandas seamlessly integrates with Python’s data science stack, making it the preferred choice for workflows involving visualization, machine learning, and statistical modeling. Polars, though improving, has a smaller ecosystem and might not yet fit all workflows requiring extensive library integrations.

Final thoughts

Rather than viewing Pandas and Polars as competitors, they can complement each other. For instance, we might use Polars for the heavy lifting on large datasets and switch to Pandas for its compatibility with downstream analysis or visualization libraries.

References and Resources

A comprehensive comparison like this would not be possible without valuable documentation, community contributions, and datasets. Below are the references and resources you can explore to deepen your understanding of Pandas, Polars, and data processing:

  1. Pandas Official: https://pandas.pydata.org/docs/
  2. Pandas GitHub: https://github.com/pandas-dev/pandas
  3. Polars Official: https://www.pola.rs/
  4. Polars GitHub: https://github.com/pola-rs/polars
  5. ChatGPT: https://chatgpt.com/

--

--

CodeX
CodeX

Published in CodeX

Everything connected with Tech & Code. Follow to join our 1M+ monthly readers

Mochamad Kautzar Ichramsyah
Mochamad Kautzar Ichramsyah

Written by Mochamad Kautzar Ichramsyah

Data analytics professional with 10 years of experience at tech companies in Indonesia.

Responses (2)