Why you need to migrate from pandas to Polars ?

Published in

iNex Blog

3 min readJul 16, 2024

In the ever-evolving landscape of data processing, choosing the right tools can significantly impact efficiency and productivity. After careful consideration, we decided to migrate to Polars, a DataFrame library designed for speed and ease of use. Here are the key reasons behind our decision:

Leveraging Rust’s Capabilities

Polars is crafted in Rust, a language celebrated for its low-level memory management without sacrificing safety, thanks to its compile-time memory safety checks. Rust’s approach to concurrency is also noteworthy; it enables safe and efficient parallel data processing, which is pivotal for handling voluminous datasets swiftly. These characteristics contribute significantly to Polars’ performance, making it an excellent tool for time-sensitive data operations.

At the core of Polars’ architecture lies its integration with Apache Arrow, an in-memory data format designed for efficient data processing. While pandas 2.0 now supports pyarrow, it is still built on top of numpy which is less efficient for non-float data types, leading to performance bottlenecks. Polars uses Arrow’s columnar memory format natively. This alignment drastically reduces memory overhead and speeds up data operations.

Lazy

Polars excels in computational efficiency through intelligent query optimization. It minimizes redundant computations and reorders operations to enhance execution speeds. This proactive optimization ensures that data processing is not only faster but also more resource-efficient.

For instance, when dealing with Sirene files containing 38 million rows, traditional tools like pandas struggle to handle the data in memory. Our previous solution involved loading data in batches, which was not scalable and made some operations, like a simple group-by median, very complicated. Polars, with its efficient data handling, eliminates these issues.


pl.scan_csv(
    path_sirene_file,
).filter(
    pl.col("trancheEffectifsEtablissement").is_in(list(dic_tranche_employes))
).group_by(
    pl.col("activitePrincipaleEtablissement").str.slice(0, length=5)
    .alias("code_ape")
).agg(
    pl.col("trancheEffectifsEtablissement")
    .replace(dic_tranche_employes, default=None)
    .median()
    .alias("median_employes")
).explain()

output:
AGGREGATE
 [
col("trancheEffectifsEtablissement")
.replace([Series, Series, null])
.median().alias("median_employes")
] 
BY [
col("activitePrincipaleEtablissement").str.slice([0, 5]).alias("code_ape")
] 
FROM

    Csv SCAN /home/data/sirene/StockEtablissement_utf8.csv
    PROJECT 2/53 COLUMNS
    SELECTION: col("trancheEffectifsEtablissement").is_in([Series])

Only required lines and rows are loaded ! Their benchmark gives a good idea of the optimization.

Fail fast is a best practice in data engineering; we don’t want to wait hours before encountering an issue like a missing column. Polars’ LazyFrame only evaluates the result when it is needed, deferring execution to the last minute. This approach has significant performance advantages and is why the Lazy API is preferred in most cases. For example, it won’t load unnecessary columns for calculations and raises a missing column exception immediately, without waiting for the entire calculation to complete. The stricter Polars schema also reduces schema-related bugs.

API

Polars offers an API that combines expressiveness with simplicity, enabling users to perform complex data manipulations with fewer lines of code and greater clarity. This not only improves code readability but also reduces the potential for errors.

Polars expressions are quite powerful and flexible, so there is much less need for custom Python functions compared to pandas. Check their documentation for the list of available expressions.

Consistency

Missing data

Polars provides a more consistent approach to handling missing data, distinguishing clearly between null and NaN, which simplifies data cleaning and preprocessing steps.

Syntax

Finally, I never understood pandas names, sometimes there is an underscore sometimes not, same for “s”: nunique, is_unique, dropna, hasnans, drop_duplicates. Polars syntax is more consistent, allowing faster developments.

Also, the fact that there is no index, makes things simpler, no more need to reset_index and less errors !

Conclusion

As you can see, Polars has many advantages compared to pandas. One of the few reasons I see to keep pandas is its rich ecosystem as it is compatible with many libraries including visualization and machine learning. For instance, we use a lot geopandas to handle geospatial data efficiently. While geopolars is still under development, we use duckdb and its spatial extension to accelerate our geospatial data processing. But if you really need to use pandas, you can still easily convert Polars DataFrame to pandas DataFrame. For those interested in trying Polars, we recommend this exercice, which we have used internally.

sources: