Using DuckDB JupySQL and Pandas in a notebook

Deepa Vasanthkumar
3 min readApr 2, 2024

DuckDB is an open-source, lightweight, and embeddable analytical database management system (DBMS) designed for efficient querying and analysis of large datasets. It is built with a focus on providing high performance and low latency for analytical workloads, making it suitable for applications ranging from data exploration and business intelligence to embedded analytics in software applications.

DuckDB is commonly used in scenarios where fast and efficient querying of large datasets is required, such as data analysis, business intelligence, reporting, and embedded analytics in software applications. Its lightweight and embeddable nature make it particularly well-suited for use cases where resources are limited or where a standalone database server is not feasible.

This story is about using DuckDB, JupySQL and Pandas interoperably in a notebook

Using DuckDB with Pandas can be advantageous when you want to leverage the efficient analytical capabilities of DuckDB for processing large datasets while still benefiting from the data manipulation and analysis features provided by Pandas. Here’s how you can use DuckDB with Pandas in a notebook:

Installation

!pip install --quiet duckdb
!pip install --quiet jupysql
!pip install --quiet duckdb-engine
!pip install --quiet pandas
!pip install --quiet matplotlib

Import Required Libraries

Load Data

Here, input_df is your Pandas DataFrame, and output_df is the name of the table you want to create in DuckDB.

Query Data with DuckDB

Execute SQL queries on the data stored in DuckDB using the `duckdb.query` function. For example:

input_df is your Pandas DataFrame

Load file into pandas and plot using sqlplot

%sqlplot pie - table input_df - column j

Load parquet file in pandas and plot using sqlplot

pandasdf = pd.read_parquet("userdata1.parquet")
pandasdf

Using sqlplot we can directly query the pandas dataframe

%sqlplot bar  --table pandasdf --column gender

By using DuckDB with Pandas, you can take advantage of DuckDB’s efficient querying capabilities for large datasets while still being able to manipulate and analyze the data using the familiar Pandas API. This approach can be particularly useful when dealing with datasets that are too large to fit into memory or when you need to perform complex analytical operations that are more efficiently handled by DuckDB’s SQL engine.

Complete Notebook is pinned here and in first comment

https://github.com/deepavasanthkumar/deepavasanthkumar/blob/main/DuckDB_JupySQL_and_Pandas_in_Colab_Notebook_.ipynb

--

--