DuckDB Turned 1(.0.0)

Celebrating DuckDB’s Milestone: Version 1.0.0

Andy Sawyer
5 min readJun 9, 2024

DuckDB has recently hit a significant milestone with its version 1.0.0 release earlier this month. This achievement is a testament to the growing popularity and maturity of DuckDB as a versatile tool in the data engineering landscape. As much as I love DuckDB, as I look back over my articles I realise that I’ve not given it much attention, instead choosing to leverage Polars within Python. In this article, we will delve into what DuckDB is, explore its use cases, and compare it with Polars. We’ll also highlight some basic functionalities to get you started.

What is DuckDB?

DuckDB is an in-process SQL database management system that is designed for analytical workloads. Unlike traditional databases that run as a separate server process, DuckDB is embedded directly into your application. This design choice makes it incredibly lightweight and easy to use, providing high-performance query capabilities with minimal overhead.

Key Features of DuckDB:

  • In-Process Architecture: Runs within your application process, eliminating the need for a separate database server.
  • Optimised for Analytics: Specifically designed to handle complex analytical queries efficiently.
  • Integration Capabilities: Seamlessly integrates with various programming environments and data processing tools.
  • Minimal Configuration: Requires little to no setup, making it accessible for quick and agile development.

Use Cases for DuckDB in Data Engineering

DuckDB’s unique architecture and capabilities make it suitable for several data engineering tasks. Here are a few use cases where DuckDB excels:

  1. Data Exploration and Analysis: Its SQL capabilities and efficient querying make it ideal for ad-hoc data exploration and analysis.
  2. ETL Processes: DuckDB can be used to transform and manipulate data in ETL pipelines, thanks to its powerful SQL engine.
  3. Data Integration: DuckDB’s ability to read and write various file formats, including CSV, Parquet, and Arrow, makes it a versatile tool for integrating different data sources.
  4. Embedded Analytics: Because it runs in-process, DuckDB can be embedded within applications to provide real-time analytics without the need for an external database server.

Comparing DuckDB and Polars

Polars is another powerful tool in the data engineering space, known for its high-performance DataFrame library in Rust, designed to handle large-scale data processing. It’s the tool I’ve used in many of my articles to date. While both DuckDB and Polars are excellent tools, they serve slightly different purposes and have different strengths.

DuckDB:

  • SQL-Based: Provides a full SQL interface, making it easy for those familiar with SQL to perform complex queries.
  • In-Process: Runs within the application process, offering low-latency data access.
  • Integration: Strong integration with various data formats and programming environments.

Polars:

  • DataFrame Operations: Similar to Pandas, it provides a DataFrame interface, which can be more intuitive for data manipulation tasks.
  • Performance: Written in Rust, Polars is optimised for high performance, especially with large datasets.
  • Lazy Evaluation: Supports lazy evaluation, which can optimise query execution by only computing the necessary operations.

Both tools are complementary rather than competitive. DuckDB is more suitable for SQL-centric workflows and tasks requiring tight integration with various data sources, while Polars excels in pure data manipulation and transformation tasks. Indeed DuckDB has the inbuilt capability to read from, and write out to a Polars DataFrame.

Getting Started with DuckDB

Now, let’s walk through some basic functionalities of DuckDB to get you started.

1. Installation:

DuckDB can be installed using various methods, including package managers, binaries, or directly from source. For Python users, it’s as simple as:

pip install duckdb

For other options, see their website.

2. Creating a Database and Table

You can create a database and table directly within your Python script or application:

import duckdb

# Create a new DuckDB database in memory
con = duckdb.connect(database=':memory:')

# Create a table
con.execute('''
CREATE TABLE users (
id INTEGER,
name VARCHAR,
age INTEGER
)
''')

3. Inserting Data

Inserting data into DuckDB is straightforward using standard SQL commands:

con.execute('''
INSERT INTO users VALUES
(1, 'Alice', 30),
(2, 'Bob', 25),
(3, 'Charlie', 35)
''')

4. Querying Data

You can query the data using SQL:

# Query the data
result = con.execute('SELECT * FROM users').fetchall()
print(result)

5. Reading from CSV

DuckDB can easily read data from CSV files:

# Reading from a CSV file
con.execute('CREATE TABLE from_csv AS SELECT * FROM read_csv_auto('path/to/your/file.csv')')

6. Integration with Polars

DuckDB integrates well with Polars, allowing you to directly convert results to a DataFrame:

import polars as pl

# Execute a query and convert the result to a Polars DataFrame
df = con.execute('SELECT * FROM users').pl()
print(df)

7. CLI

If you don’t want to use Python, DuckDB also comes with a native CLI application that you can run directly from the command line. I can imagine this being huge for Insights Analysts that want to get information from a csv file that’s too big to open in Excel.

Advanced Features and Integrations

DuckDB also supports various advanced features and integrations that can enhance your data engineering workflows:

  • Parquet and Arrow Integration: Directly read and write Parquet and Arrow files, making it easy to work with large-scale data stored in these formats.
  • Extension Support: Extend DuckDB’s capabilities with various extensions, such as those for full-text search, geospatial data, and more.
  • Seamless Integration: Integrates with other data processing tools like Apache Spark, making it a versatile choice for various data engineering tasks.

Conclusion

DuckDB’s version 1.0.0 marks a significant milestone in its development, solidifying its position as a powerful tool for data engineers. Its unique in-process architecture, strong SQL capabilities, and seamless integration with various data sources make it an excellent choice for a wide range of data engineering tasks. Whether you’re exploring data, building ETL pipelines, or embedding analytics into your applications, DuckDB offers a robust and efficient solution.

With this milestone, DuckDB continues to demonstrate its potential and versatility, making it a valuable addition to any data engineer’s toolkit. As you explore its capabilities, you’ll find that DuckDB is not just a database, but a powerful ally in your data engineering endeavours.

--

--

Andy Sawyer

Bringing software engineering best practices and a product driven mindset to the world of data. Find me at https://www.linkedin.com/in/andrewdsawyer/