Polars dataframe’s plugins and extensibility: getting started

Published in

datamindedbe

7 min readNov 16, 2023

Image generated by Author with Bing Image Generator

Although Pandas was (and still is) one of my core workhorses for my data engineering and data science projects for the past years, I cannot hide the fact that I’ve bumped into some pitfalls. Some of them have been solved with the introduction of Pandas 2.0 like support for Arrow, or nicer handling of indexes. At the same time, Polars is getting increasingly more popular. Recently, they improved the ability to extend the core Dataframe API, which I believe is really a killer feature.

In this article, I want to show the extensibility feature using a couple of examples. Afterward, you should have sufficient background to start experimenting on your own.

The ecosystem and plugins around data frames

Polars’ ecosystem is further enhanced by the emergence of plugins designed to extend data frame functionalities. These plugins are a game-changer, enabling users to tailor the data frame experience to specific use cases without having to start from scratch.

For example, Functime will help you with time series forecasting and explain it using LLMs:

import polars as pl
from functime.cross_validation import train_test_split
from functime.feature_extraction import add_fourier_terms
from functime.forecasting import linear_model
from functime.preprocessing import scale
from functime.metrics import mase

# Load commodities price data
y = pl.read_parquet("https://github.com/neocortexdb/functime/raw/main/data/commodities.parquet")
entity_col, time_col = y.columns[:2]

# Time series split
y_train, y_test = y.pipe(train_test_split(test_size=3))

# Fit-predict
forecaster = linear_model(freq="1mo", lags=24)
forecaster.fit(y=y_train)
y_pred = forecaster.predict(fh=3)

# functime ❤️ functional design
# fit-predict in a single line
y_pred = linear_model(freq="1mo", lags=24)(y=y_train, fh=3)

# Score forecasts in parallel
scores = mase(y_true=y_test, y_pred=y_pred, y_train=y_train)

# Forecast with target transforms and feature transforms
forecaster = linear_model(
    freq="1mo",
    lags=24,
    target_transform=scale(),
    feature_transform=add_fourier_terms(sp=12, K=6)
)

# Forecast with exogenous regressors!
# Just pass them into X
X = (
    y.select([entity_col, time_col])
    .pipe(add_fourier_terms(sp=12, K=6)).collect()
)
X_train, X_future = y.pipe(train_test_split(test_size=3))
forecaster = linear_model(freq="1mo", lags=24)
forecaster.fit(y=y_train, X=X_train)



y_pred = forecaster.predict(fh=3, X=X_future)

So far nothing special, but then you can use y_pred.llm.analyze on the y_pred dataframe. Boom.

import polars as pl
import functime.llm

context = "This dataset comprises of historical commodity prices between 1980 to 2022."

# Analyze trend and seasonality for two commodities
analysis = y_pred.llm.analyze(
    context=dataset_context,
    basket=["Aluminum", "Banana, Europe"]
)
print("📊 Analysis:\n", analysis)

# Compare two baskets of commodities!
basket_a = ["Aluminum", "Banana, Europe"]
basket_b = ["Chicken", "Cocoa"]
comparison = y_pred.llm.compare(
    basket=basket_a,
    other_basket=basket_b
)
print("📊 Comparison:\n", comparison)

In Pandas, this is not possible, but not unachievable by for example writing extra functions.

Another example, the Polars Business data frame expression plug-in focuses on business day utilities. In their example they add 5 business days to the given column of dates. This type of processing is not trivial to implement but is often necessary in the business context.

from datetime import date

import polars as pl
import polars_business as plb


df = pl.DataFrame(
    {"date": [date(2023, 4, 3), date(2023, 9, 1), date(2024, 1, 4)]}
)
result = df.with_columns(
    date_shifted=plb.col("date").bdt.offset_by(
      '5bd',
      weekend=('Sat', 'Sun'),
    )
)
print(result)

I believe is the start of an entire ecosystem of plugins for dataframes that did not exist in Pandas.

Start creating your own data frame extensions

Polars provides a unique avenue for customization through its extensive API, allowing you to extend the DataFrame functionalities yourself.

For example, while previous examples are maybe random, let’s make it a bit more realistic. Let’s say I want to tag a date being a Belgian public holiday with 1, else 0. In other words, we need to check if a date is in a collection of dates. We’ll assume this collection is finite for simplicity.

Via pyo3-polars

For the fastest processing possible, you can use pyo3-polars. You avoid Python’s global interpreter lock (GIL) which locks the Python interpreter to a single thread, blocking today’s multi-core hardware to be used to its fullest potential. The downside is that will need to know Rust to provide an actual implementation of your extension. ChatGPT can certainly help you there, but you will need to dive into some Rust nitty gritty details like:

Ownership and Borrowing: Rust’s ownership model is unique, involving rules about how references to data work, which ensures memory safety without a garbage collector. Understanding how ownership, borrowing, and lifetimes work is crucial when writing Rust code, especially when passing data between Rust and Python. https://doc.rust-lang.org/book/ch04-00-understanding-ownership.html
Cargo and Crates: Learn how to use Cargo, Rust’s build system and package manager, as well as how to manage dependencies with crates. You will likely use the pyo3 crate, which is a bridge between Python and Rust, to create Python bindings. https://doc.rust-lang.org/book/ch01-03-hello-cargo.html
pyo3 Library: Deep dive into the pyo3 library. This is essential for interfacing Rust with Python. Understand how to define Python modules and functions in Rust, work with Python objects, and convert between Rust and Python types.
Error Handling: Rust uses Result<T, E> for returning and propagating errors. Understanding this pattern is important because it's likely you'll need to handle errors that could occur during the data processing and conversion between Python and Rust. https://doc.rust-lang.org/book/ch09-02-recoverable-errors-with-result.html
Multithreading: Rust’s approach to concurrency is also unique due to its ownership model. If your extensions need to handle concurrent operations, understanding Rust’s concurrency model, especially the use of Arc, Mutex, and Rayon for parallelism, will be useful. https://doc.rust-lang.org/book/ch16-01-threads.html
Traits and Generics: Rust’s trait system is used to define shared behavior. Generics are used to build flexible, reusable code. When working with Polars, which is generic over the data type, understanding how to work with these features in Rust is beneficial. https://doc.rust-lang.org/book/ch10-00-generics.html
FFI (Foreign Function Interface): While pyo3 abstracts much of this away, it may still be useful to understand how FFI works in Rust to troubleshoot any complex issues that arise when interfacing with Python. https://doc.rust-lang.org/book/ch19-01-unsafe-rust.html
Macros: Rust macros are powerful and can be used to write code that writes other code, which is particularly handy for reducing boilerplate when binding Rust to Python. https://doc.rust-lang.org/book/ch19-06-macros.html

The list can go on. So not so trivial for the majority of data engineers and scientists.

For the example below, I tried to keep it simple and used ChatGPT. What came out worked out of the box, and was comprehensible enough for me (a Rust novice).

Here is holidays.rs which defines the Rust logic to split it.

use polars::prelude::*;
use pyo3_polars::derive::polars_expr;

#[polars_expr(output_type = Int32)]
fn tag_belgian_holidays(inputs: &[Series]) -> PolarsResult<Series> {
    let ca = inputs[0].date64()?;
    
    let belgian_holidays = vec![
        date_option(2023, 1, 1),
        date_option(2023, 5, 1),
        date_option(2023, 7, 21),
        // Add other Belgian holidays
    ];

    let tagged_holidays: NoNull<Int32Chunked> = ca.into_iter()
        .map(|opt_date| {
            if let Some(date) = opt_date {
                if belgian_holidays.contains(&date) {
                    Some(1)
                } else {
                    Some(0)
                }
            } else {
                None
            }
        }).collect();

    Ok(tagged_holidays.into_series())
}

fn date_option(year: i32, month: u32, day: u32) -> Option<Date32> {
    Some(Date32::from_ymd(year, month, day))
}

You would then compile and create a shared lib for this Rust file via run cargo build --release. This will produce a .so (or .dll) file in the target/release directory.

On the Python side, you would need to register these functions in holidays.py.

import polars as pl
from polars.utils.udfs import _get_shared_lib_location

lib_path = _get_shared_lib_location(__file__)

# Register the Rust function as a Polars expression
@pl.api.register_expr_namespace("holiday")
class BelgianHolidays:
    def __init__(self, expr: pl.Expr):
        self._expr = expr

    def tag_belgian_holidays(self) -> pl.Expr:
        return self._expr._register_plugin(
            lib=lib_path,
            symbol="tag_belgian_holidays",
            is_elementwise=False
        )

To call it, simply:

df = pl.DataFrame({
    "dates": [pl.Date("2023-01-01"), pl.Date("2023-02-14"), pl.Date("2023-05-01")],
})

result = df.with_columns(
    is_holiday=pl.col("dates").holiday.tag_belgian_holidays()
)

print(result)

Via plain Python

You could define extensions in plain python. It’s simpler (no Rust knowledge required), but slower. This is the way I would probably start just to have something working fast and iterate afterward.

Here is an alternative holiday_python.py that defines the logic in plain Python.

Note the difference with the pyo3 way: I extended the data frame namespace instead of registering a custom expression. It will act on the entire data frame. More info: https://pola-rs.github.io/polars/py-polars/html/reference/api.html.

import polars as pl
from datetime import date

@pl.api.register_dataframe_namespace("holiday")
class BelgianHoliday:
    def __init__(self, df: pl.DataFrame):
        self._df = df

    def tag_belgian_holidays(self, date_col: str) -> pl.DataFrame:
        belgian_holidays = {
            date(2023, 1, 1),
            date(2023, 5, 1),
            date(2023, 7, 21),
            # Add other Belgian holidays via holidays pypi package
        }
        
        def tag(row):
            return 1 if row[date_col].date() in belgian_holidays else 0

        return self._df.with_columns([
            pl.col(date_col),
            pl.col(date_col).apply(tag, return_dtype=pl.Int32, name="is_holiday")
        ])

To call it, simply

df = pl.DataFrame({
    "dates": [pl.Date("2023-01-01"), pl.Date("2023-02-14"), pl.Date("2023-05-01")],
})

result = df.holiday.tag_belgian_holidays("dates")
print(result)

Why bother?

While both Polars and Pandas offer extensible APIs, the philosophy and execution differ significantly. Pandas is an established library with a large, active community and a wide array of functionalities already built-in. While you can extend Pandas, the approach is often less straightforward, and the documentation surrounding it is not as extensive as its core functionalities. Additionally, extending Pandas often means diving into its deep codebase, which might be daunting for many users.

In contrast, Polars is designed with extensibility as a forefront feature. The library encourages you to create custom extensions and integrates this into its ecosystem more seamlessly. Polars’ API is more modern and simplified, and the library provides thorough documentation specifically aimed at those looking to extend its functionalities. This makes it easier to start extending Polars even if you’re not deeply familiar with its internal workings.

Moreover, Polars’ built-in support for plugins and extensions aligns well with modern software engineering practices like modularity and code reusability. The Polars ecosystem is growing to include specialized plugins for tasks like time-series analysis, business day calculations, and parallel processing, which are either directly built by the community or promoted by Polars itself. This sort of focused, community-driven development around extendibility is something that is just starting to pick up in Pandas.

Lastly, due to its newer design, Polars is better suited for cloud-based and distributed computing environments. It’s easier to integrate custom Polars extensions into large-scale, cloud-native data pipelines compared to Pandas, which was originally designed for single-machine data analysis.

Conclusion

In summary, while both libraries offer the capability for API extension, Polars makes it easier, more efficient, and more aligned with modern data engineering and data science practices.

As a practitioner, before implementing custom extensions yourself, check if somebody or a team has made and released something. If you are out of luck, start with the plain Python way. If you know (or are willing to know) your way with Rust, use pyo3-polars instead.

References

https://pola-rs.github.io/polars/py-polars/html/reference/api.html#

Join the discussion

Share your insights in the comments to enrich our community dialogue. Let’s explore this tech journey together!
If you enjoyed the content, give it a thumbs up, share it, and tag your colleagues. For more about Data Minded, visit our website. Your engagement fuels our collaborative learning environment.