DuckDB for Data Science

Published in

Crux Intelligence

4 min readAug 5, 2022

DuckDB is an open-source relational database management system (DBMS) tool that supports structural query language (SQL).

In the industry, there is a wide range of database management systems (DBMS). But all have different trade-offs for specific use-cases, there is no one-size-fits-all database system. DuckDB is no different. It is an in-process SQL-based OLAP database management system that provides a simple interface for in-process databases.

Why DuckDB?

Fast Analytical Operation

DuckDB is designed to support analytical query workloads (a.k.a. OLAP). The queries are complex, long-running, and take significant portions of the stored dataset. For example, aggregation over entire tables or join operation on several large tables.

To efficiently operate these operations, duckdb contains a columnar-vectorized query execution engine where queries are interpreted, but instead of performing the row-level operation, it processes a large batch of values (a vector) in one operation.

Simple Operation

DuckDB adopts the ideas of simplicity and embedded operation from the world’s most widely deployed DBMS (SQLite).

DuckDB is a serverless database with no external dependencies.
Embedded completely within a host process.
High-speed data transfer to and from the database.
It can process datasets without copying.

Free and Open-source

DuckDB’s development started while the main developers were public servants in The Netherlands. DuckDB is free to use and the entire code is available on GitHub. It comes with an MIT license which means you can use it for commercial purposes.

Feature Rich

It provides serious data management features. Provide extensive support for complex SQL queries with a large function library etc.
It provides transactional guarantees (ACID properties) through MVCC and bulk optimization.
It supports secondary indexes for efficient query execution.
It is deeply integrated into Python and R for efficient data analysis. It also provides APIs for JAVA, C, C++, and others.

Implementation

System Setup

As DuckDB provides deep integrations with Python and R language. So it is pretty easy to set up in the local system.

In Python, you can use the pip package for DuckDB installation

pip install duckdb

Similarly, you can use different languages for installation.

Data Import

In Python, DuckDB supports multiple ways to connect to datasets in the process like

Connect to the in-memory temporary dataset

Creating a temporary database instance is straightforward with DuckDB

In-memory Database

Connect to dataset files

With DuckDB, it is really easy to connect to datasets in files.

It provides support file formats like CSV files and Parquet Files. You can use advanced parameters to support different versions of CSV and Parquet files.

Output for importing dataset directly to Python environment

Connect to Pandas Dataframe

Importing data through Dataframe

Python Like Function

For all operations, DuckDB provides python equivalent functions like

Aggregation Operation

DuckDB provides a wide range of aggregation operations. In the notebook, sharing a few. You can find the exhaustive list of the aggregation operations here

Other Operation

DuckDB provides an extensive list of functions for OLAP operations on datasets. Operations like

Conclusion

DuckDB is an open-source, in-process database management system that does not require any external dependencies, neither for compilation nor for execution.

It follows the same SQL structure as PostgreSQL.
It provides support to run complex SQL queries on a temporary dataset without copying the dataset.
It is deeply integrated with Python and R language and also provides APIs for JAVA, C++, and C.
Instead of performing row-level operations, DuckDB leverages a columnar-vectorized execution engine to process a large batch at a time.

If you are interesting to see more like these article buy me a coffee