Supercharge your data processing with DuckDB

Efficient & blazing fast SQL analytics in Pandas with DuckDB

Vivek Vaddina
Learning SQL
10 min readJun 5, 2022

--

Photo by Mohamed Elsayed on Unsplash

Do you have large datasets that you simply can’t load into memory to analyse with Pandas? Or do you feel more comfortable expressing operations in SQL instead of python?

Fret not, for you have DuckDB now! ✨🦆✨

Introduction

Most of us data scientists are familiar with Pandas, the de-facto data processing library in Python. It’s simple, powerful, able to handle a myriad data formats & sizes and for the most part, very intuitive. And people familiar with SQL as well know how powerful it too is in its ability to handle complex data transformations with a concise syntax. But sometimes, you have to trade off execution speeds in favor of intuitivity/expressiveness or vice versa. This especially becomes painful when you either have large (OOM) datasets or need to perform complex data wrangling on your data. At times, it makes more sense to handle that in SQL rather than in Pandas.

DuckDB aims to bring these together with a fast & efficient SQL query execution engine that can run complex queries on large sets of data. It integrates tightly with Pandas DataFrames and allows us to run these queries directly on top of them without needing to take data in & out of it.

In this tutorial, we shall see how we can leverage DuckDB in various analytical workloads.

Setup

DuckDB is very lightweight and has no external dependencies and runs within the host process itself. Simply install it with:

To initialize it, run:

That’s it! Now you can test it by running:

Next step is to run pip install pyarrow to add support for reading/writing parquet data.

Jupyter Notebook setup

If you wish to explore it in Jupyter Notebooks, install a few additional libraries for a better experience:

And import them once installed:

duckdb setup

From now on, you can run SQL directly by prefixing %sql (or %%sql for multiline statements) to your code cell and get your output returned as pandas DataFrame 👯

DuckDB vs traditional Databases

With pandas.read_sql command, one can already run SQL queries on an existing DB connection, and load data as pandas DataFrames. While this is fine for lightweight operations, it is not optimized for heavy data processing. Traditional RDBMSs such as Postgres, MySQL, etc. process each row sequentially which apart from taking long time to execute, also induce a lot of overhead on CPU. DuckDB on the other hand is built with OLAP in mind and is Column-Vectorized. This helps massively parallelize disk I/O and query executions.

DuckDB uses the Postgres SQL parser under the hood, and offers many of the same SQL features as Postgres.[1]

Accessing DataFrames with DuckDB

One of the major limitations of pandas is that it’s single threaded. Hence we can’t take advantage of the multi-core CPUs which every computer ships with these days. DuckDB supports multi-threading natively and hence can take advantage of all the cores of your CPU.

There exist a few some external libraries that enable parallel processing but only on specific operations.

While this is turned on by default, we also can explicitly set the number of threads:

Warning: When using free tier of Google Colab, please set it only to 2 to avoid crashing the kernel instead.

We can also write a short snippet to run an SQL query, and return the result as a Pandas DataFrame in a pythonic way:

Now create a DataFrame & run a simple query:

Alternatively, you can also compute the result with the line magics and store the result as a python variable that you can access elsewhere:

Notice that we didn’t declare df as a view, nor did we create any table. DuckDB simply looks for any python variable that was declared before if the table name doesn’t exist.

Tip: Sometimes, you override a data frame by performing an operation & assign it to the same variable. For e.g., df = df.query(...). If in case, you see that the duckdb’s df is still not updated, you can (always) ‘register’ the dataframe explicitly as dbcon.register('df', df);

Accessing files with DuckDB

Often times we have data stored either in DB or in parquet/csv formats and they could be huge or may not/barely fit in memory. Loading them with pandas incur significant memory overhead and also possibly creates multiple copies when performing certain operations. With DuckDB, such files can be queried directly without any copying because of direct integration with Apache Arrow.

To demonstrate this, we’ll download an airline delay & cancelation dataset from kaggle. The compressed archive is about ~2GB and when deflated, creates a couple of CSVs totalling~8GBof data.

Let’s have a quick look at the contents by loading a single csv file:

quick preview of csv file

we can load efficiently with pandas too by specifying the nrows parameter:

This takes much less time than with DuckDB. It’s possibly because such operations are optimized in pandas and also for computing simple metrics across rows (such as mean, min, max etc.), pandas can make use of highly optimized numpy routines under the hood. But, as we’ll see later, these would only go so far.

To load all these CSVs into one dataframe with pandas is very inefficient and takes a lot of time. And it’ll most likely fail if your computer doesn’t have a beefy RAM. If you still gotta do it, you’d have to iterate through all CSVs and concatenate like so:

A better alternative would be to merge all those CSVs into a single parquet file. Unlike CSVs, parquet files are structured and as such are unambiguous to read [2]. They are compressed columnar files and hence take much less space than individual CSVs combined because of the efficient Arrow data format.

Notice that we directly copied all the CSVs using regex to a single parquet file without ever copying data into pandas.

Now, we ended up from a 7.5GB combined CSV files to a 2.5GB parquet file which now easily fits into memory. So we can load it with pandas as:

Because of the efficient parquet format, we could load the entire 61 Million rows of data in mere seconds.

Projection & Filter Pushdowns

Now let’s do a simple filter operation on our dataset. Let’s count the total number of rows that satisfy the condition TAXI_OUT > 10. We’ll try with both pandas & DuckDB.

The same operation is near instantaneous with DuckDB. There’s just no comparison.

This is because DuckDB automatically optimizes the query by selecting only the required column(s) (aka projection pushdown) and then applies the filtering to get a subset of data (aka filter pushdown). Pandas instead reads through all the columns. We can optimize this in pandas by doing these pushdowns ourselves.

We managed to bring this down from several seconds to almost a second. But using DuckDB is still ~70-90% faster than this.

Using Groupby

Now let’s calculate a few aggregates using groupby with projection & filter pushdowns combined.

Here, we compute a few simple metrics with a certain airline carrier grouped by two origin & destination airports and finally sort the results by the origin airport.

We can make it a bit more concise by using .query for filtering pushdown.

This approach takes only about half the time (~3s) compared to our earlier one because .query uses a modified syntax of python and also indexing thus resulting in more efficient evaluation. We can now compare that to our SQL counterpart.

This ~400ms execution above with duckdb is around an order of magnitude faster and also a lot cleaner, I’d say. 😉

Notice that the data is already loaded under df and hence we don’t need to read from the source parquet file.

In the same way, we can also improve the performance of our queries drastically when using joins across multiple tables. I leave this as an exercise to the reader.

But why actually load data into memory in the first place when we can process it more efficiently with it being just on disk? Often times, the data is too big to load into memory anyways.

Accessing data with Views

To do that, we just need to create a VIEW to our data which lets us query the table directly without loading onto memory and update the source from the dataframe df to the newly created view instead.

This is such a powerful feature and allows us to explore lot more data than traditionally possible with pandas.

Using approximations

At times, it suffices just to get an estimate of certain data rather than a precise answer. Using approximations would help us to just that.

using approximations is faster

Let’s run the actual counts…

Our approximation query earlier ran about 3–4 times faster than the precise one in this case. This is crucial when responsiveness is more important than precision (esp. for larger datasets).

Using Window functions

Finally, let’s conclude our analysis by showing off a bit more of what DuckDB can do using some advanced SQL operations.

We create two CTEs (Common Table Expressions) to calculate a couple of features. We do filter & projection pushdowns in one CTE & compute our desired features in another. The first feature is a simple demo to showcase if-else support. The second feature is a bit advanced where we find out the last destination a given air carrier has flown to, sorted by flying date. And when it doesn’t exist, replace it with NA. We then take a sample from the final resultant set.

Nice, isn’t it?! This same operation is unimaginably complex (for me, at least) in pandas. 🤯

With DuckDB, we can combine one or more of many of such complex operations and execute them in one go without worrying much about manual optimizations.

Conclusion

We have noticed how performant DuckDB is and how it brings the whole SQL ecosystem into Pandas. Its simple installation and light footprint means that we can integrate this into our existing workflows with minimal effort and achieve maximum gains in terms of execution speeds. We can also continue using pandas on larger datasets without loading them into memory or jumping onto a full blown distributed computing setup (for a reasonable extent).

Thanks so much for reading this far ❤️. Now you know most of the important stuff about DuckDB to get yourself started. Bear in mind that we hardly scratched the surface. DuckDB offers so much more. You can use correlated subqueries, nested types, etc. apart from its many user friendly features such as column aliasing in group by/having, auto-incrementing duplicate columns, better string slicing and so on. Their documentation is very clean & beginner friendly and the blog has very informative posts as well. I encourage you to check those out.

Sidenote

  • Modin tries to parallellize pandas workflow by distributing the workload to multiple CPU cores. Have a look and am guessing that we can also combine modin & DuckDB for an even faster runtimes.
  • Fugue is a new framework that provides a unified interface so that users can execute their Python, Pandas, and SQL code on various distributed computing platforms such as Spark and Dask without rewrites. Please check here for a nice introduction from its maintainer Kevin and also checkout this article from Khuyen to get a feel of using it with DuckDB.

The code for this notebook can be found here and you can launch it with binder/colab/deepnote.

P.S.: The exact execution times might vary a bit depending on the load & build of your computer. I also noticed that some of the SQL operations are cached and the first computation sometimes takes a bit of time but running it again (or running after changing the values of the columns in WHERE clause) for certain queries would only take a couple of ms later on. YMMV.

I hope you have enjoyed this post and learnt something from it. Please let me know your thoughts/suggestions (or any mistakes) in the comments below.

Happy Ducking! 💚🦆💚

Go from SELECT * to interview-worthy project. Get our free 5-page guide.

--

--

Vivek Vaddina
Learning SQL

Freelance Data Scientist interested in predictive modeling, data analysis & technical ✍️. https://linktr.ee/vivekvaddina