Polars, Partitions and Performance with Delta Lake

Jimmy Jensen
3 min readFeb 10, 2024

--

A little something on efficiently reading delta tables

A friendly polar bear | Photo by Hans-Jurgen Mager on Unsplash

TL;DR

If you read partitioned delta table with Polars use pyarrow_options.

Introduction

Polars and Delta Lake (Delta-RS) works great together.

There are many ways to read partitioned delta tables with the Polars dataframe library.

import polars as pl
from deltalake import DeltaTable

# Simple DataFrame with Polars filter
df = pl.read_delta("table_name").filter(pl.col("is_current") == True)

# Simple LazyFrame with Polars filter
df = pl.scan_delta("table_name").filter(pl.col("is_current") == True).collect()

# DataFrame with partitions pyarrow_options
df = pl.read_delta(
"table_name",
pyarrow_options={
"partitions": [
("is_current", "=", "true")
]
}
)

# LazyFrame with partitions pyarrow_options
df = pl.scan_delta(
"table_name",
pyarrow_options={
"partitions": [
("is_current", "=", "true")
]
}
).collect()


# Delta table with filters - convert to polars DataFrame
df = pl.from_arrow(
DeltaTable("table_name").to_pyarrow_table(
filters=[
("is_current", "==", True)
]
)
)

# Delta table with partitions - convert to polars DataFrame
df = pl.from_arrow(
DeltaTable("table_name").to_pyarrow_table(
partitions=[
("is_current", "=", "true")
]
)
)

Which one do you prefer syntactically? I prefer option 1 or 2, as they are most concise in my opinion, however let’s look at the performance of each.

Testing

I made some sample data with 5 columns and 100.000 records. I then ran 11 tests. The first test had 0 percent records in the is_current=true partition, the second had 10 percent, and so on.

Read 100.000 records with 5 columns from Delta Table. Values in milliseconds

To no surprise, the read times when all the data is one partition (100 percent) are nearly identical, however it is noticeable that the smaller the partitions are, the more difference there is between using filter-methods and partition-methods.

Let’s add some more columns to the data, and run the tests again.

Read 100.000 records with 10 columns from Delta Table. Values in milliseconds

Once again, reading using the partition-options beats the filter-options. It also seems that the deltatable_with_partitions_to_dataframe is ever so slightly ahead.

One last test is to bump up the records to 2.000.000 and double the amount of columns.

Read 2.000.000 records with 20 columns from Delta Table. Values in milliseconds

What happened here? As the number of records are relatively small, all the data is compressed in a single parquet file. However, when the number of records increase, the number of parquet files also increase.

There is a significant performance boost using the partitions pyarrow_option, when there are many files.

Conclusion

While the simple_lazyframe_with_polars_filter is faster than its sibling dataframe method for small partitions, there is an even faster method of reading partitioned delta tables using the pyarrow_options. With smaller partitions it seems to be a 2–4x increase in this example, however it is highly dependant on other factors too, such as number of records, number of columns, and number of files in the partitions being read.

As such, these tests are just preliminary with sample data, and I encourage you to do your own tests for the data you’re working with.

Even though reading the delta table with the Delta Lake library and converting it to a Polars dataframe seems to be the absolute fastest, it is for me and my projects not enough of a gain to care about.

I much prefer the Polars syntax as far as it gets me. But if performance is your top priority, you could look into implementing Delta Table method in your project.

I’ll make sure myself to favorize the partitions pyarrow_options over filter going forward.

--

--