[part1] DuckDb vs Spark on Iceberg 1 Billion NYC taxi rides

5 min readMay 1, 2024

I ran a benchmark comparing Spark(local mode) on Iceberg & duckDb for 10 years of Nyc yellow taxi rides.

Lets get started with the data

For 10 years 2023–2014 (inclusive).
The dataset used is in Parquet format & is approximately 16 GB in size.
Uncompressed parquet goes up to 32GB
It contains ~ 1 Billion rows.
1 file was corrupt & was removed from the dataset, hence the # of rows
for the benchmark reduced to 0.88 Billion = 877,108,497

Data setup: pulled data to local disk, 1ts 16G

#!/bin/bash

STAGING_DIR=nyc_yellow_cab_data_staging
base_url="https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_"
years=("2023" "2022" "2021" "2020" "2019" "2018" "2017" "2016" "2015" "2014")
months=("01" "02" "03" "04" "05" "06" "07" "08" "09" "10" "11" "12")


mkdir $STAGING_DIR
project_dir=`pwd`

# Loop through each string in the array
for year in "${years[@]}"
do
 for month in "${months[@]}"; do
  mkdir -p nyc_yellow_cab_data_staging/$year/$month
      cd  $STAGING_DIR/$year/$month
      url="${base_url}${year}-${month}.parquet"
      echo "Downloading $url"
      wget $url 
      cd -
                sleep 1
 done
done

cd $project_dir

Machine Setup

1 MacBook Air (M1, 2020),
16GB RAM, 512 GB Flash Storage,
macOS 12.6.4
8core (4 performance and 4 efficiency)

Benchmark Environment:

duckDB was run in python 3.10.0, read parquet stored in local file system.
Spark was run in java in local mode, with Iceberg warehouse catalog pointing to local file system.
Data was loaded (~301sec) into Iceberg warehouse tables before running benchmark. loading time was excluded in benchmark.
9GB of memory was allocated to both duckDB & Spark.
All cores were used for both duckDB & Spark.

The Queries (16 of them)

Some of them were taken from clickhouse Vs Redshift blog post, towardsDatascience blog post & some written by me.

duckberg/queries.py at main · jaihind213/duckberg

duckdb vs iceberg some findings. Contribute to jaihind213/duckberg development by creating an account on GitHub.

github.com

Query 16 was the big one (taken from towardsDatascience):

    select 
        period,
        count(*) as num_rides,
        round(avg(trip_duration), 2) as avg_trip_duration,
        round(avg(trip_distance), 2) as avg_trip_distance,
        round(sum(trip_distance), 2) as total_trip_distance,
        round(avg(total_amount), 2) as avg_trip_price,
        round(sum(total_amount), 2) as total_trip_price,
        round(avg(tip_amount), 2) as avg_tip_amount
    from (
        select
            date_part('year', tpep_pickup_datetime) as trip_year,
            strftime(tpep_pickup_datetime, '%Y-%m') as period,
            epoch(tpep_dropoff_datetime - tpep_pickup_datetime) as trip_duration,
            trip_distance,
            total_amount,
            tip_amount
        from parquet_scan('file://nyc_yellow_cab_data/**/*.parquet')
        where trip_year >= 2021 and trip_year <= 2024
    )
    group by period
    order by period

Query 1 was :)

select count(*) from parquet_scan('file://nyc_yellow_cab_data/**/*.parquet')

The 2 runs (with laptop restart before each run 😎)

# Restart laptop to clear file system cache
# Run query 1 to 16 serially, file system cache will play a role.
bash loop_duck_queries.sh

# Restart laptop to clear file system cache
# Run query 1 to 16 serially, file system cache will play a role.
bash loop_spark_iceberg_queries.sh

The Query times (query 1 — count * took 50ms in duckDb 😉 )

The % Change from Spark to duckDb, atleast 200x improvement 👀

query 1 was count(*) was 5000x.

Not surprisingly, duckDB outperformed Spark SQL-Iceberg on all queries.

Note:

Data load time, Program startup times were not considered.
duckDB loading time was 0sec, as it read from files directly.
Apache Iceberg data loading was a 1 time job and recorded a time of ~300sec.
Only the query execution times were considered.
If startup times were to be considered, duckDB would outperform Spark Sql on Iceberg by even larger margins.

Thoughts for Technical Debate

Why all the data resided on local disk not s3/azure blob storage ?
Its unfair as Spark SQL was run in local mode and not a cluster mode i.e. more than 1 node.

My thoughts:

The compressed parquet dataset size for 1Billion was 16GB. 
Any node on ec2/azure/gcp can handle 16GB disk.

Downloading the data to disk (say aws s3 cp) + reading from local file system 
is faster than reading from s3/blob storage.

Hence this pattern is feasible.
Feel free to read from s3/azure blob storage.

The longest running query was query16.The peak memory usage for spark was about 
~8.5GB, so it well within limits (I could have made Xmx=10g for safety). 

When it can complete on 1 node, why run on cluster ?

One might think 'So y use Spark for 16G small dataset,benchmark is not valid?'. 
Spark on local mode is a well known processing method. 
People move to clusters when limits are breached without no code changes.

Dataset may be small on disk i.e. compressed parquet. 
Uncompressed it would be 32GB,

You might think 'Still small! Not a case for Spark?'.

Either way(whether we go for cluster or its a small dataset), 
you are edging out Spark on Iceberg on this 1 Billion row dataset :)

Perhaps polars, is a better competitor to duckDB for this benchmark ? 

Ps: With Spark local mode, Shuffling gets eliminated, unlike in the cluster.

If you want to check polars vs duckdb, this blog post says duckdb is 350x faster than polars. 😎

The code for this benchmark:

github.com/jaihind213/duckberg

Hiccups while doing benchmark

The column data types were different in data files. (ex: passenger_count was sometimes long, sometimes int). Had to standardize the data before running the benchmark.

So why would I go for duckDB ?

Cost is the hidden/overlooked dimension to data engineering. Cost based engineering is the new OIL.

Story edit — see part2 (trying duckdb on iceberg & polars)

Ps: I am the Chief Inspiration officer at Bytespire.io

happy to help you with your Cost Based Data Engineering needs.

feel to reach out to me at [LinkedIn] or [Twitter] or [X]

Book a free consultation

[part1] DuckDb vs Spark on Iceberg 1 Billion NYC taxi rides

duckberg/queries.py at main · jaihind213/duckberg

duckdb vs iceberg some findings. Contribute to jaihind213/duckberg development by creating an account on GitHub.

Written by Vishnu Rao