You can use a supercomputer to send an email but should you?

Discover the next evolution in data processing with DuckDB and Polars

Published in

datamindedbe

6 min readMar 12, 2024

Spark is marking its tenth birthday this year, which is a good moment to reflect on its relevance as a data processing engine. Even after a decade, Spark remains the de facto standard in distributed processing — a remarkable achievement considering the surge in competitors over the years.

For me, there are two crucial advancements that fueled the competition:

The resources available for a cloud instance have increased significantly. For instance, the memory capacity has evolved from 64GB to 4TB on AWS.
Many companies have become more pragmatic in designing their data pipelines, realizing that not every pipeline necessitates a distributed engine.

As a result, the pendulum swung back towards single-node processing tools, such as DuckDB and Polars, rather than focusing only on distributed processing engines. In this post, I will delve into both engines and explain how they can complement Spark.

This blog post is derived from a recent talk that I presented together with Jan Vanbuel. If you prefer to watch the recording, you can find it here.

Data processing landscape

When examining data processing engines, I find it useful to categorize them based on three key characteristics: scalability, simplicity, and cost efficiency. Every data processing engine positions itself on the edge of the triangle as it focuses primarily on two characteristics.

Data processing engines positioning with respect to: simplicity, scalability and cost efficiency

The different edges can be described as follows:

Mapreduce and Spark on k8s reside on edge combining cost efficiency and scalability. These technologies enable companies to scale out using commodity hardware which is typically a lot cheaper than scaling vertically.
Products like Snowflake and Databricks focus on making it as easy as possible to scale out your workloads. They make distributed processing a lot more accessible for companies as they handle the cluster management for you.
On the edge between cost efficiency and simplicity, Pandas stood alone for quite some time, with limited evolution from 2010 to 2022.

DuckDB and Polars fill a void that was left by Pandas 1.0 as it failed to win over many data engineers and analysts. This was primarily due to issues such as inefficient memory management, a confusing API, and slow performance for string processing. In contrast to data science, data pipelines operate primarily on strings, which is why this was a far bigger issue.

DuckDB 101

For me the best description of DuckDB is the following:

We are the sqlite for online analytical processing

DuckDB is an in-process database, which means that your database runs in the same process as your code. This makes it extremely easy to get started with as you only need to install a single binary. DuckDB is first and foremost a database so it has all the features you expect from a database: transactional guarantees, a rich SQL dialect, and support for different file formats,…

Query using DuckDB that shows top 10 airlines that flew from JFK in 2015

I really like DuckDB for the following scenarios:

Since you don’t have to configure a database before running SQL queries, DuckDB is my go-to tool for SQL tutorials and training sessions. If you do not believe me, note that even dbt uses DuckDB in their getting started project.
The combination of dbt with DuckDB is a good alternative for Spark data pipelines, which I have discussed already in a previous blogpost.

DuckDB uses a vectorized query engine, enabling it to perform transformations on a vector of input values instead of one value at a time. Additionally, DuckDB supports out-of-core processing, allowing it to handle datasets larger than the available memory by storing intermediate results on disk. These features make DuckDB an excellent choice for single-node processing, offering both speed and the ability to handle a variety of datasets.

A final aspect of DuckDB is the extensibility through the use of plugins. Thanks to this feature the community can help make DuckDB more accessible. The most important plugins are the integrations with the different cloud providers to read data from blob storage. If you want to try writing your own plugin, you can do so in C++ or rust.

Polars 102

The tagline of Polars is:

A dataframe front-end for an analytics engine

Polars adopted a distinct approach from DuckDB by initially focusing on offering a functional dataframe API similar to the one of Apache Spark’s. Starting with Polars is as simple as running pip install polars.

Query using Polars that shows top 10 airlines that flew fromJFK in 2015

Polars also uses a vectorized engine as well as has support for streaming to support larger than memory workloads. However, the out-of-core processing support of Polars is currently less mature than in DuckDB.

You can extend the Polars API by writing your own plugin in Python or Rust. This allows you to improve the readability or performance of certain operations.

One of the Data Minded’s client teams has a lot of experience in migrating Spark data pipelines that handle a limited amount of data to Polars. The transition has improved the performance of these pipelines and the migration process went smooth due to the similarity between both APIs.

Next to providing a data frame API, Polars is also starting to support SQL. At the moment the SQL API is still a work in progress but it is interesting to see them add features to compete with DuckDB.

The escape hatch Apache Arrow

Creating an ecosystem around a new processing engine takes time as there are a lot of systems to integrate with cloud providers, databases, and data formats,...

One of the reasons why these tools saw such a fast adoption can be attributed to Apache Arrow. Arrow is a specification defining an in-memory data format. With an engine implementing this specification, transitioning from one engine to another supporting Arrow can be done seamlessly without the need for data duplication.

As an example, Jan Vanbuel encountered an issue with the Polars support for Excel files on Azure blob storage. As a workaround, he loaded the data using Pandas and then converted it to a Polars dataframe for subsequent operations.

Selecting the right engine

With the introduction of 2 new processing engines, you might wonder when to use them. I have created a flowchart to help you with the decision-making process, based on three key questions.

Flow chart to help deciding which processing engine to use for a given usecase

How much data are you processing?

The volume of data to be processed is an important factor in selecting the appropriate processing engine. As a general guideline, I typically consider 100GB as the threshold. Beyond 100GB, a cluster of machines usually offers better performance, whereas, below it, a single-node processing engine produces faster results. The bottleneck for processing large volumes of data is more often I/O rather than processing, which is why parallelizing across a cluster of machines is often more efficient.

Do you want to manage a cluster yourself?

When using a cluster of machines, the next question is whether you have the skills and want to spend time managing and maintaining your cluster. If the answer is negative, consider exploring vendors that offer managed solutions, such as Databricks or Snowflake. Otherwise, if you prefer a do-it-yourself approach, Spark on Kubernetes is a good option, as is AWS’s EMR.

Who is doing the data processing?

In order to decide which single-node processing engine is a good fit for your use case, you can look at which profiles are responsible for doing the data processing. Each engineering profile favors a specific tool as it aligns with their way of working. A commonly observed categorization is as follows:

Data engineers prefer Polars
Data scientists prefer Pandas
Data analysts prefer SQL so DuckDB is a great fit

Conclusion

In this blog post, I discussed two emerging data processing tools: DuckDB and Polars. I highlighted their shared characteristics but also their distinct approaches to single-node processing. DuckDB mainly focuses on SQL and Polars focuses on the dataframe abstraction.

I concluded by offering guidance for choosing the appropriate data processing engine for a particular use case. To help with this decision, I used three questions:

How much data are you processing?
Do you want to manage your cluster yourself?
Who is doing the data processing?

I encourage you to try out both processing engines to gain a deeper understanding of their capabilities. If you use them for different use-cases or have any insights to share, feel free to let me know in the comments section.