Dismantling BigData with DuckDB

Published in

Israeli Tech Radar

6 min readMay 23, 2023

Say the CEO of your company selects you to show off your data engineering skills at a company. You have to build an ETL pipeline, some processing, some transformation, nothing too fancy, just a simple pipeline that would create a new revenue stream for the company. A LOT of money they say. No Pressure!

You want to impress, so you want to write it in Apache Spark. You create a Spark cluster, maybe even pay Databricks, or use K8s, you use Zeppelin or Jupiter to analyze the data and you’re on a roll!

YOU ARE ON TOP OF THE WORLD!

until… something in the data does not add up, and the infrastructure has some funky timeouts and waits and crashes, and your spark job has a big overhead, takes time to start running, and sometimes even fails on OutOFMemory.

You are trying to debug but debugging takes a long time and running Spark on your laptop is just not the same as in the cloud, and slowly but surely you feel the whole schtick slipping out of your hand…

What if, you could do all this from the command line of your laptop?

What is BIG DATA

Big data was originally associated with three key concepts. The three V’s:

Volume, Variety, Velocity

From reading and experience, I can safely say, that the main reason we use many of the heavy-duty tools today is because of “velocity” and not because of the other two V’s: volume and variety.

Yet another interesting definition of Big Data can be found among others in the first sentence of Wikipedia’s definition:

Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing application software.

So let’s talk about those data sets too large or complex to be dealt with by traditional data-processing software.

The Promise

Working with data there is always the underlying premise that there would be a lot of data. A bucket load of it. It might start small, but soon enough it will explode exponentially. And for that, we need to be ready! we need to prepare!

The Reality

Many if not most of the time, we are not working with a lot of data. We might store a lot of data, but our daily work and even ETLs are working with a small subset of that data.

One of my standard questions when interviewing a client with a need for a new data architecture is: “How much data are we talking about?” To my surprise most answered with: ”Not too much!” They then go on to promise more data once they acquire more customers, but even that remains to be seen.

It seems as if they are trying to apologize for not having enough data :-)

The modern data stack

Data Sources, processing tools, data warehouses or data lakes, synchronizers, orchestrators, analytics, and more and more tools and paradigms make up the modern data stack. Endless tools with endless capabilities for endless options to deal with endless data…

But are all these necessary?

“Biggish” Data

I would like to introduce you to a new concept: Biggish Data: data sets that were too large or complex to be dealt with by traditional data-processing software.

Introducing DuckDB

DuckDB is an in-process SQL RDBMS OLAP database management system. You might want to think of it as a non-traditional database to deal with large or complex data sets.

Using DuckDB is as easy as downloading it from the internet, running it from your CLI, and Bob’s your uncle.

For those who cannot detach themselves from Python, here you go: after a quick “pip install duckdb”:

import duckdb
# to start an in-memory database
con = duckdb.connect(database=':memory:')

Of course, you could use it with many more languages which are supported such as R, Java, and NodeJS, just to name a few

Why should I care?

Well, I think you should, especially if you are a Data Scientist or a Data Engineer.

As a Data Scientist, you can use DuckDB for:

Data Analysis on your Laptop
Data Format Translator

As a Data Engineer, you can use it for:

Declutter Data Stack
Biggish Data ETL

Data Analysis on your Laptop

Performing data analysis on Big Data is being done with tools such as Apache Zeppelin, and Jupiter Notebooks, or on the platforms themselves, such as Databricks notebooks or Snowflake worksheets.

But what about Biggish Data? Using DuckDB, you can access data from sources such as Local Disk, S3, and even HTTP/S in formats such as CSV, JSON, and Parquet. It is as easy as writing an SQL query, nothing more.

CREATE TABLE mytable as 
SELECT column1, sum(column2) column2, sum(column3) column3
FROM parquet_scan('s3://my-source-bucket/input/*.parquet')
group by column1

At this point, you have a table in memory, with which you can do as you please. For instance, using theALTER TABLE statement to change the schema of an existing table, rename columns, add/drop columns and anything else you can do with theALTER TABLE statement.

Using DuckDB, analyzing Data is as easy as 1,2,3.

Biggish Data ETL

Just as explained in the paragraph before, having “only” Biggish Data, it does not make sense to have all these heavy-duty tools for your ETL. They do not come without faults:

Databricks, Snowflake, and such cost quite a buck.
Apache Spark and tools like it are complicated to start, configure and even maintain.
Large Data stack for “just” biggish data

DuckDB ETL

When using DuckDB for your ETL it is embedded into your process or application, there is no other tool involved. The data used for DuckDB does not need to fit into Memory, as long as it fits on the Hard Disk. Due to its resource-friendliness when dealing with a lot of data, DuckDB will perform a graceful degradation in performance, but it will Never Fail.

For parquet files, DuckDB can use a combination of the Parquet metadata and HTTP range requests to only download the parts of the file that are actually required by the query. This will lower the read and transfer rate drastically, on top of their compaction.

So don’t copy the data to your hard drive, rather use DuckDBs’ capabilities and read the data from the remote location

Example

COPY (
 SELECT column1, sum(column2) column2, sum(column3) column3
 FROM parquet_scan('s3://my-source-bucket/input/*.parquet')
 group by column1
) TO 's3://my-starget-bucket/output' (
 FORMAT PARQUET,
 PARTITION_BY (column1, column2, column3),
 ALLOW_OVERWRITE TRUE
);

For those who are already enthusiastically invested in DBT, you can even try dbt-duckdb but details for this will come in a later post!

Conclusion

I think the main point I am trying to make is the following: Just because you have Biggish Data does not mean you have to keep using heavy-duty tools. Open your head and try to reimagine Big Data!