Pandas vs Polars: Is learning Polars worth the performance boost?
Introduction
In this article, we’ll explore how Pandas and Polars compare in terms of performance, usability, and practicality. From data loading, data cleaning, aggregation, and joining. We’ll dive deep into real-world benchmarks to help us decide which library is better suited for our needs.
Pandas has been a primary option of data analysis in Python for over a decade. It provides an intuitive, user-friendly interface for data manipulation. Its DataFrame and Series objects allow analysts to easily work with structured and tabular data. However, Pandas was designed during a time when data sizes were typically smaller and single-threaded performance was adequate.
Polars, on the other hand, is a relatively new library designed to address modern data processing challenges. Written in Rust, Polars leverages parallel processing to maximize speed and efficiency.
Setup
- Python version: 3.11.9
- Pandas version: 2.2.3
- Polars version: 1.19.0
- Hardware: MacBook Pro 14-inch, 2021. Apple M1 Pro, Memory 16 GB, MacOS Sequoia 15.1.1
- Code Editor: Visual Studio Code 1.96.2
import sys
print("Python version: ", sys.version)
import pandas as pd
print("Pandas version: ", pd.__version__)
import polars as pl
print("Polars version: ", pl.__version__)
Performance comparison
In this opportunity, there are 6 things that we want to compare:
- Data loading
Efficient data loading is critical when working with large datasets. - Data transformation
Common operations like filtering, grouping, joining, and aggregating were tested. - Lazy evaluation
Polars’s lazy evaluation is one of its standout features. - Multi-threading
Modern processors thrive on parallelism, and Polars leverages this effectively. - Memory usage
Efficient memory usage is crucial when working with large datasets to avoid crashes or slowdowns. - Ease of use
While performance is important, usability often determines adoption in practice.
Real-world use cases
Dataset
We’ll use the NYC Yellow Taxi Trip dataset, a real-world dataset available at the NYC Taxi & Limousine Commission. It contains millions of rows and includes trip details like pickup/dropoff locations, timestamps, trip distance, and fare amount.
Use case 1: Data loading and cleaning.
Task: Load the dataset, filter trips longer than 10 miles, and remove rows with missing values.
import pandas as pd
import polars as pl
import time
file_path = "yellow_tripdata_2023-01.parquet"
pandas_times = []
arrow_times = []
polars_times = []
# Run the tests 10 times
for _ in range(10):
# Pandas execution
start_time = time.time()
df_pd = pd.read_parquet(file_path)
df_pd_cleaned = df_pd[df_pd["trip_distance"] > 10].dropna()
pandas_times.append(time.time() - start_time)
# Pandas execution
start_time = time.time()
df_ar = pd.read_parquet(file_path, dtype_backend="pyarrow")
df_ar_cleaned = df_ar[df_ar["trip_distance"] > 10].dropna()
arrow_times.append(time.time() - start_time)
# Polars execution
start_time = time.time()
df_pl = pl.read_parquet(file_path)
df_pl_cleaned = df_pl.filter(pl.col("trip_distance") > 10).drop_nulls()
polars_times.append(time.time() - start_time)
comparison_table = {
"Trial": list(range(1, 11)),
"Pandas Time (s)": pandas_times,
"pyArrow Time (s)": arrow_times,
"Polars Time (s)": polars_times
}
comparison_df = pd.DataFrame(comparison_table)
comparison_df["Difference Multiplier Pandas to Polars"] = (
comparison_df["Pandas Time (s)"] / comparison_df["Polars Time (s)"]
).round(2)
comparison_df["Difference Multiplier pyArrow to Polars"] = (
comparison_df["pyArrow Time (s)"] / comparison_df["Polars Time (s)"]
).round(2)
comparison_df["Faster Package"] = comparison_df.apply(
lambda row: min(
("Pandas", row["Pandas Time (s)"]),
("Polars", row["Polars Time (s)"]),
("pyArrow", row["pyArrow Time (s)"]),
key=lambda x: x[1]
)[0],
axis=1
)
comparison_df
Based on the data loading and cleaning comparison, Polars is, on average, 2.27 times faster than Pandas and 1.57 times faster than pyArrow when processing data loading and cleaning on a dataset with over 3 million rows, 18 columns, and a memory footprint exceeding 444.6 MB.
Use case 2: Aggregations.
Task: Calculate the average fare amount per passenger count.
import pandas as pd
import polars as pl
import time
pd_times_groupby = []
pl_times_groupby = []
# Run the tests 10 times
for _ in range(10):
# Pandas execution
start_time = time.time()
avg_fare_pd = df_pd.groupby("passenger_count")["fare_amount"].mean()
pd_times_groupby.append(time.time() - start_time)
# Polars execution
start_time = time.time()
avg_fare_pl = df_pl.group_by("passenger_count").agg(pl.col("fare_amount").mean())
pl_times_groupby.append(time.time() - start_time)
comparison_table_groupby = {
"Trial": list(range(1, 11)),
"Pandas Time (s)": pd_times_groupby,
"Polars Time (s)": pl_times_groupby
}
comparison_df_groupby = pd.DataFrame(comparison_table_groupby)
comparison_df_groupby["Difference Multiplier"] = (
comparison_df_groupby["Pandas Time (s)"] / comparison_df_groupby["Polars Time (s)"]
).round(2)
comparison_df_groupby["Faster Package"] = comparison_df_groupby.apply(
lambda row: "Polars" if row["Polars Time (s)"] < row["Pandas Time (s)"] else "Pandas",
axis=1,
)
comparison_df_groupby
Based on the aggregation comparison, Polars is, on average, 204.7 times faster than Pandas.
Use case 3: Joining datasets.
Task: Join the NYC Yellow Taxi data with a dataset containing information based on pickup location IDs (132).
In this case, we only use .head(100) for the new datasets due to the high memory needed if want to join the full datasets.
import pandas as pd
import polars as pl
import time
df_pd_pu132 = df_pd[df_pd['PULocationID']==132].head(100)
df_pl_pu132 = df_pl.filter(pl.col("PULocationID") == 132).head(100)
pd_times_join = []
pl_times_join = []
# Run the tests 10 times
for _ in range(10):
# Pandas execution
start_time = time.time()
join_pd = pd.merge(df_pd, df_pd_pu132, on="PULocationID")
pd_times_join.append(time.time() - start_time)
# Polars execution
start_time = time.time()
avg_fare_pl = df_pl.join(df_pl_pu132, on="PULocationID")
pl_times_join.append(time.time() - start_time)
comparison_table_join = {
"Trial": list(range(1, 11)),
"Pandas Time (s)": pd_times_join,
"Polars Time (s)": pl_times_join
}
comparison_df_join = pd.DataFrame(comparison_table_join)
comparison_df_join["Difference Multiplier"] = (
comparison_df_join["Pandas Time (s)"] / comparison_df_join["Polars Time (s)"]
).round(2)
comparison_df_join["Faster Package"] = comparison_df_join.apply(
lambda row: "Polars" if row["Polars Time (s)"] < row["Pandas Time (s)"] else "Pandas",
axis=1,
)
comparison_df_join
Based on data joining comparison, Polars is, on average, 13.05 times faster than Pandas when do join.
Limitations
Pandas
- Performance on large datasets: Pandas struggle with very large datasets due to their single-threaded processing and high memory usage.
- Memory inefficiency: Pandas uses a row-oriented storage model, which is less efficient than columnar storage for many analytical tasks.
- Lack of parallelism: Most operations in Pandas are single-threaded, which means it does not fully utilize modern multi-core processors.
- Steep performance degradation for complex operations: Operations like grouping, joining, or applying custom functions over large datasets can be significantly slower compared to more optimized libraries like Polars.
Polars
- Smaller ecosystem: Polars is a relatively new library compared to Pandas and does not yet have as extensive an ecosystem or community support. Some functionalities available in Pandas, such as direct integration with certain libraries (e.g., Matplotlib, Scikit-learn), may not be fully supported in Polars.
- Limited documentation and tutorials: While Polars’ documentation is improving, it is not as comprehensive or beginner-friendly as Pandas’ documentation.
- Customization and flexibility: Although Polars provides excellent performance for standard operations, it may lack the flexibility Pandas offers for highly customized workflows, particularly those requiring complex custom functions.
- Compatibility with older systems: Polars is optimized for modern hardware and requires a relatively recent Python version and dependencies. It might not be ideal for legacy systems.
Conclusion
The choice between Pandas and Polars ultimately depends on the specific needs of the data analysis workflow. Both libraries have unique strengths, and understanding their differences can help you make the best decision for each use case.
Key takeaways
- Performance and scalability: Polars is a clear winner when it comes to handling large datasets, thanks to its multi-threaded processing, memory-efficient architecture, and columnar storage.
- Ease of use: Pandas is ideal for beginners and users who value simplicity and a vast ecosystem of supporting libraries. Its rich documentation and widespread community make it accessible and beginner-friendly.
- Advanced features: Polars introduce capabilities such as lazy evaluation, enabling optimization of complex queries before execution. Pandas, while lacking these features, provide a flexible and well-tested environment for handling the most common data manipulation tasks.
- Ecosystem and integration: Pandas seamlessly integrates with Python’s data science stack, making it the preferred choice for workflows involving visualization, machine learning, and statistical modeling. Polars, though improving, has a smaller ecosystem and might not yet fit all workflows requiring extensive library integrations.
Final thoughts
Rather than viewing Pandas and Polars as competitors, they can complement each other. For instance, we might use Polars for the heavy lifting on large datasets and switch to Pandas for its compatibility with downstream analysis or visualization libraries.
References and Resources
A comprehensive comparison like this would not be possible without valuable documentation, community contributions, and datasets. Below are the references and resources you can explore to deepen your understanding of Pandas, Polars, and data processing:
- Pandas Official: https://pandas.pydata.org/docs/
- Pandas GitHub: https://github.com/pandas-dev/pandas
- Polars Official: https://www.pola.rs/
- Polars GitHub: https://github.com/pola-rs/polars
- ChatGPT: https://chatgpt.com/