How DuckDB can be up to 1000x more efficient than BigQuery?

Christophe Oudar
9 min readFeb 18, 2024

--

BigQuery & DuckDB racing (Bing Copilot)

Introduction

While the title may sound like clickbait, it can happen by capitalizing on DuckDB’s strengths while exploiting BigQuery’s weaknesses. In this article, we’ll dig into a practical example to understand how this can be achieved.

If you’ve stumbled upon this post, chances are you’re familiar with both platforms:

  • Google BigQuery is one of the state of the art cloud data warehouse on the market. It can process very complex queries at a very large scale (petabyte scale). It’s been around since 2010. It can spread the workload across thousand “slots” (VMs to leverage horizontal scaling).
  • DuckDB is a fast in-process analytical database. It can process very complex queries at a fairly large scale (terabytes scale with the appropriate hardware). The first paper is out since 2019. It works in a single process and therefore doesn’t support horizontal scaling but can leverage vertical scaling.

In the subsequent sections, we’ll focus on DuckDB SQL snippets, but all necessary SQL queries for benchmarking purposes can be found in this GitHub repository: https://github.com/Kayrnt/duckdb_nested_array_perf. This repository includes a dbt project specifically crafted to execute the transformations outlined in this article, although we won’t directly measure dbt commands’ runtime to avoid factoring in dbt overhead.

Benchmark

Method

To compare both systems, we’ll use 2 metrics:

  • Wall time: This represents the user-perceived time taken by the query.
    - On BigQuery, we’ll use the Elapsed time.
    -
    On DuckDB, we’ll use hyperfine CLI tool and look at Total time.
  • CPU time: This is the sum of CPU time used to compute the query.
    - On BigQuery, we’ll use the Slot time.
    -
    On DuckDB, we’ll use hyperfine CLI tool and look at usr+sys.

Wall time is indeed a valuable measure for assessing query performance. However, since we’ll be evaluating BigQuery performance, CPU time is also an important metric. It helps estimate the query’s cost if utilizing capacity compute pricing. While it can’t be taken at face value without comparing the same hardware, it provides insight into the scale of effort required to handle the workload.

Hardware

For BigQuery, we’ll allow the platform to utilize the necessary resources. At this scale, I don’t anticipate reaching any quotas. On the DuckDB side, we’ll employ a MacBook Pro M1 Max with 32 GB of RAM.

Software

Considering BigQuery is a managed service, we’ll assume we’re using the “current” version as of February 2024, while DuckDB will leverage the latest 0.10.0 version. For BigQuery, we’ll execute a single run (as it’s not free). With DuckDB, we’ll utilize the command hyperfine --warmup 1 -r 3 ./query.sh, which includes one warmup and then we'll select the median out of three runs to mitigate the "cold start" effect or potential outliers that may not accurately reflect performance.

Baseline query

To begin, let’s examine the simplest query:

SELECT 1

Here’s what we get:

SELECT 1 results

As you can observe, the platforms perform quite similarly. There’s a slight increase in wall time for BigQuery, likely due to accessing “on-demand” resources, whereas DuckDB operates on a dedicated machine. However, we can estimate this increase to be around 100 ms, which may become negligible for larger-scale queries.

A real life use case

Let’s envision a scenario where we’re working on an analytics use case. We have a fact table structured based on a star data modeling approach. This fact table includes an array of structures containing labels and scores. Our objective is to join these labels with a dimension table containing additional properties, such as a category (e.g., “bar” column, as depicted in the diagram). While ideally, we’d use proper identifiers instead of labels, in practice, the values may be custom, similar to how Google Analytics allows users to define their own event parameters.

Since the fact table holds most of the data, the rows are relatively “fat.” The transformation itself isn’t overly complex and can be outlined as follows:

Preparing the data

To prepare the data, we’ll employ a deterministic query using GENERATE_ARRAY (for BigQuery) and GENERATE_SERIES (for DuckDB). For demonstration purposes, we’ll examine data at three different scales: 10K, 100K, and 1M rows for our fact table, while maintaining the dimension table at 1K rows. To add a touch of realism, only 500 out of the 1K dimension values will actually be present, reflecting the common scenario of a “sparse” join.

While DuckDB could potentially allow us to store the data in memory rather than on disk, such a comparison would be akin to comparing apples to oranges. Therefore, we’ll adhere to a disk-based benchmark since, in practical scenarios, data is typically persisted (especially when processing batches).

Let’s start with the fact table:

create  table "dev"."main"."fact_table" as (
WITH
all_rows AS (
SELECT
GENERATE_SERIES(1, 100000) AS id
),
array_id_int AS (
SELECT
id,
CAST(CAST(id AS INTEGER) % CAST(500 AS INTEGER) AS INTEGER) array_size,
GENERATE_SERIES(0, array_size, 1) array_sample
FROM all_rows
)
SELECT
CURRENT_TIMESTAMP AS hour,
id,
CONCAT('name ', id) AS name,
id * 123 AS col1,
id * 456 AS col2,
CAST(id % 2 AS BOOLEAN) AS col3,
id AS col4,
CONCAT('col5 ', id) AS col5,
id AS col6,
CONCAT('col7 ', id) AS col7,
CONCAT('col8 ', id) AS col8,
CONCAT('col9 ', id) AS col9,
CONCAT('col10 ', id) AS col10,
CONCAT('col11 ', id) AS col11,
CONCAT('col12 ', id) AS col12,
array_sample AS col13,
ARRAY(
SELECT
CONCAT('col14 ', x.array_sample)
FROM UNNEST(a.array_sample) x
) AS col14,
ARRAY(
SELECT
{'name': CONCAT('col15 ', x.array_sample), 'score': x.array_sample}
FROM UNNEST(a.array_sample) x
) AS col15
FROM array_id_int a
);

Next the dimension table:

create  table "dev"."main"."dimension_table" as (
SELECT
CONCAT('col15 ', GENERATE_SERIES) AS name,
GENERATE_SERIES AS id,
MOD(GENERATE_SERIES, 2) AS bar
FROM GENERATE_SERIES(1, 10000, 1)
);

Data size

After generating the tables, we end up roughly with following scale of data:

Fact table

Dimension table

As you can see, the dataset isn’t particularly large. The dimension table is quite small and can comfortably fit into memory on any executor. Depending on the executor’s size, we can anticipate that the majority of the fact table will also fit into memory.

“Naive” approach

Query

To merge both tables, let’s begin with a straightforward join on the name column, followed by a group by operation to eliminate duplicates. We enhance our raw fact table by selecting all columns except the one used for joining, as it will be updated:

create or replace table "dev"."main"."naive_approach"
as (
SELECT
* exclude(col15),
ARRAY(
SELECT
{'id': st.id, 'bar': st.bar}
FROM UNNEST(col15) AS c
JOIN "dev"."main"."dimension_table" AS st
ON c.col15.name = st.name
GROUP BY
st.id,
st.bar
) AS joined
FROM "dev"."main"."fact_table" AS bt
);

Benchmark

Let’s get started for the actual numbers:

Insights

DuckDB

First impression is DuckDB is much faster than BigQuery as long as the workload fits your hardware resources: DuckDB scales almost linearly which absolutely makes sense considering the benchmarked query. However it’s pretty sad to see DuckDB not succeeding a dataset of 1M rows as it appears there are no operations that require to keep all rows in memory and since the pattern works with 100K (and the next 900K are similar) if was properly “streaming” the input data, it should be completed in around 95 seconds (or maybe more if we assume we are being slowed by the disk that might not be consistent regarding read + write performance). I also observed that the CPU usage doesn’t appear to be maxxed out on the queries (it peaked at 300% so about 3 cores out of the 10 of the M1 Max) so there might be some room for improvement if it’s not only memory/disk bound: running the same query with DuckDB in memory database allow to trim the runtime from 9,5s to 4s on 100K facts.

BigQuery

On the other hand, BigQuery shows that it can scale horizontally as growing 10x the scale once is only going to take 4,5x more time… but it consumes 16x more CPU time. The overhead is definitely big on the horizontal scaling and even growing larger with a 17.5x raise from 100K to 1M rows. It feels like it’s not as optimized as DuckDB but we have to consider at least 4 factors:

  • The overhead to support horizontal scalability. It could offset a part of it as the workload should be easy to parallelize on the disk read/write whereas DuckDB is using a single disk.
  • The BigQuery hardware could very likely be less efficient than an apple M1 core so comparing CPU time is not very fair (unless you expect Google to upgrade their hardware like they “upgraded” their prices last year 😏 ). I read that 100 slots = 64 cores + 256 GB so we can assume a slot is 0.64 core + 2,5 GB of RAM but we’ve got no details on the actual processor generation & speed.
  • There might be some delay for the job to actually start as slots are being assigned for the job to run. Usually it ranges from milliseconds to few seconds.
  • All integers are … INT64! It’s the case even if our values are small and could fit in an INT32 or actually in an unsigned INT8 with some room in the case of our dimension ids.

The fact that the slot time doesn’t scale linearly show that the shuffle (which is happening) is eating a good chunk of the performance. The query plan visible in the execution graph also tend to confirm it’s not optimal as output records on the 1M run grows to 250M. It’s what I would expect with an average around 250 rows in the array we’re joining if you’re unnesting it. Then it points out we could see a more efficient approach as BigQuery shouldn’t need to unnest the data but just “map” it row by row.

Filtered

Now that we saw a “basic” join, we might wonder if we could help the platforms to be more efficient as we know more about our data: only 500 values out the 1000 from the dimension table exist so… what about filtering our dimension table before the join? That’s what we’re going to try:

create or replace table
"dev"."main"."filtered_approach"
as (
WITH unnested AS (
SELECT UNNEST(col15) st
FROM "dev"."main"."fact_table"
),
fact_table_col15_name_values AS (
SELECT distinct st.name
from unnested
),
dimension_table_filtered AS (
SELECT st.*
FROM "dev"."main"."dimension_table" st
JOIN fact_table_col15_name_values v ON st.name = v.name
)
SELECT
* exclude(col15),
ARRAY(
SELECT
{'id': st.id, 'bar': st.bar}
FROM UNNEST(col15) AS c
JOIN dimension_table_filtered AS st
ON c.col15.name = st.name
GROUP BY
st.id,
st.bar
) AS joined
FROM "dev"."main"."fact_table" AS bt
);

Benchmark

Let’s get started for the actual numbers:

filtered benchmark numbers

Insights

DuckDB

Our approach is performing better on DuckDB. It’s faster in terms of wall time, although it’s slightly more CPU-intensive on a smaller scale due to an added step that takes some time. However, this is offset when dealing with a larger scale of 100K facts. Similar to our previous attempt, we encountered an out-of-memory (OOM) issue with a table of 1 million rows. There seems to be an issue with memory allocation that is preventing us from scaling, but this is hopefully fixable considering DuckDB is not yet at version 1.0.0.

BigQuery

The results from BigQuery are somewhat unexpected. Only the query with 1 million rows was faster. We were somewhat more efficient with 10K rows but significantly less so with 100K. At this point, I’m considering whether I need more runs to obtain reliable numbers. For instance, the second run on the 100K query finished in 4 minutes and 50 seconds, but it used 4 hours and 6 minutes of slot time, which makes it slower but more efficient. Hence, it’s challenging to draw a clear conclusion from these numbers. It seems that optimizing the SQL didn’t consistently improve performance.

Conclusion

We could continue with further optimizations such as hashing, improved typing, or experimenting with a flattened form with clustering. However, it’s likely that the outcomes would be similar. DuckDB benefits from vertical scalability, but there’s a point where you’ll likely encounter memory or IO limitations, perhaps more frequently than CPU limitations. For handling real big data, solutions like BigQuery and similar distributed computing platforms become crucial. While they may not always be the most efficient, they can process almost any workload. The only limitation is your budget for slots.

What would you suggest to improve that query and/or the data structure? Feel free to suggest and follow up 🙌

🎁 If this article was of interest, you might want to have a look at BQ Booster, a platform I’m building to help BigQuery users improve their day-to-day.

Also I’m building a dbt package dbt-bigquery-monitoring to help tracking compute & storage costs across your GCP projects and identify your biggest consumers and opportunity for cost reductions. Feel free to give it a go!

--

--

Christophe Oudar

Staff Software engineer at @Teads, mostly working on Big Data related topics