Head-to-head comparison of 3 dbt SQL engines

Niels Claeys
datamindedbe
Published in
8 min readSep 8, 2023

--

A couple of months ago, I wrote a blogpost to illustrate how dbt and Duckdb can be useful for your data pipelines.

Tl;dr: when you have medium sized data (up to 100GB) and certainly if your data pipelines are managed by data analytics engineers, Dbt + Duckdb is a good fit because of its simple execution model and its use of the SQL syntax.

After that blogpost, I received a lot of requests from people to extend my comparison and add benchmarks for several other execution platforms. This follow up post will therefore compare the user experience and the performance of 3 popular open-source SQL engines for dbt, namely: Spark, Trino, Duckdb.

Benchmarking setup

In order to compare the different engines, I will use the TPC-DS benchmark for my project. This is the same project that I used in my initial blogpost.

  • The TPC-DS input data is generated using the Databricks toolkit and the output is written to a S3 bucket. This bucket will be used as input for testing all 3 SQL engines.
  • I want to compare the performance for medium-sized datasets, so I used a scale factor of 100 to generate the input data. This results in about 100 GB of input data spread out over 20 tables.
  • All benchmarks use the same m6.2xlarge instances from AWS, which have 8 vCPU and 32 GB of RAM. Additionally, I added 100Gb of disk storage on the nodes, which can be used for shuffling or storing temporary data depending on the SQL engine.
  • The code used for every benchmark, is available on my github.
  • I run every dbt model sequentially for simplicity. This way I do not have to take into account that multiple queries could be running simultaneously. This could negatively impact Trino because it uses central infrastructure whereas Duckdb and Spark provision resources for every query.

dbt and Duckdb

As explained in my previous blogpost, I like the combination of dbt and Duckdb for the following reasons:

  • Duckdb is simple to install because it is self-contained, packaged in just a single binary.
  • Duckdb is an in-process OLAP database, which means that it runs within the same (e.g. python) process as our dbt transformations.
  • Duckdb integrates seamlessly with blob storage (e.g. S3, GCS,…), which allows you to run dbt queries against blob storage
  • Duckdb provides a simple execution model. Your code is executed against Duckdb, which runs on a single node. This simplifies writing transformations as you do not need to think about how code or data gets distributed across nodes.

In order to use dbt with Duckdb we use the dbt-duckdb adapter.

Installing Duckdb

All that is needed is to include the dbt-duckdb dependency, which will transitively include Duckdb. One way to achieve this is to specify the dependency in a requirements.txt file like:

dbt-duckdb==1.5.0

User experience

Implementing the project with dbt and Duckdb is straightforward because of the built-in functionality of Duckdb:

  • You can refer to external data, in our case the parquet files residing on S3, by using the sources.yml. This allows you to use the input data without first defining a query to load the data into your Duckdb database.
  • You can use the AWS credentials provider to authenticate your job and interact with AWS resources. The credential provider supports many authentication mechanisms depending on your execution environment. In our case we use a WebIdentityToken, which gets mounted into our kubernetes pods.
  • You can offload your model output to blob storage using the external materialization concept in dbt-duckdb.

dbt and Trino

Trino is a popular open-source distributed SQL query engine, previously called Presto. It is gaining traction because of the following features:

  • It allows for query federation: combine data from different data sources within one SQL query.
  • It supports large scale data processing and can thus easily handle larger datasets than used in this blogpost.
  • ANSI SQL compliant, which makes it compatible with many external tools (e.g. PowerBI, Tableau,…)

Trino uses a typical master-slave architecture, which they call coordinators and workers respectively.

Installing Trino on kubernetes

In order to run our dbt queries against Trino, we first need to install it in our kubernetes cluster. This is fairly straightforward thanks to the existing helm chart. If you do not know Helm, take a look here.

In order to make the benchmarking results comparable, I use the m6.2xlarge instances for both the coordinator and worker node. Next to choosing the node instance size, you also need to specify resource constraints as well as JVM parameters (e.g. heap size) for both the coordinator and worker process.
In my opinion it is a pity that all these settings need to be statically defined instead of deriving several of them based on the node resources. I ended up with the following settings for configuring the resource requirements:

coordinator:
jvm:
maxHeapSize: 24G
config:
query:
maxMemory: 16GB
maxMemoryPerNode: 16GB
resources:
requests:
cpu: 7
memory: 28Gi
limits:
memory: 28Gi
worker:
resources:
requests:
cpu: 7
memory: 28Gi
limits:
memory: 28Gi
jvm:
maxHeapSize: 24G
config:
query:
maxMemoryPerNode: 16GB

Starburst, the company behind Trino, publishes a dbt adapter which we use in our project.

User experience

I started from the Duckdb dbt models and assumed (wrongly) that the SQL syntax in these models would be the same. I learned that Trino is more strict in its syntax than Duckdb, which is built to be Postgres compatible. The result is that I needed to update about 50% of the dbt models. The most common issues were:

Data types mismatch

Cannot check if date is BETWEEN varchar(10) and date

Trino detects mismatching data types, but does not attempt casting the value to the most specific data type. This means that you need to manually specify the cast as follows:

-- Original
BETWEEN '2000-01-27' AND cast('2000-04-26' AS date)

-- Replacement
BETWEEN cast('2000-01-27' AS date) AND cast('2000-04-26' AS date)

Column name not defined

The second and more ‘annoying’ error in my opinion was:

Column name not specified at position X

Trino does not generate a default column name for aggregated columns but instead requires that you provide an explicit alias. This results in the following changes:

-- Original
SELECT SUM (ss_quantity)
FROM store_sales

-- Replacement
SELECT SUM (ss_quantity) sum_quantity
FROM store_sales,

dbt and Spark

Apache Spark is at the moment one of the most popular distributed engines for large scale data analytics. Most data pipelines interact with Spark using its Python or Scala API. Additionally, Spark also includes a SQL engine, which makes it possible to run dbt models.

The main reasons for choosing dbt together with Spark are in my opinion:

  • You can reuse your existing Spark infrastructure to run dbt models. No need to introduce an extra technology for these data pipelines.
  • Spark shines when running transformations on large scale datasets. It can thus easily handle larger datasets than used in this benchmark.
  • Spark can run everywhere. It can be installed on a kubernetes cluster but most cloud providers also include dedicated services for running Spark (e.g. Databricks, EMR,…)

Installing Spark on k8s

In order to execute the TPC-DS queries, I use Conveyor, which runs Spark on top of kubernetes. This uses the Conveyor operator to create a Spark cluster, but an equivalent setup can be achieved by using the kubernetes Spark operator created by Google.

In order to run dbt models against Spark, we use the dbt-spark adapter.

Using Spark with dbt

First you need to decide how dbt will connect to your Spark cluster. There are 4 possible options: odbc, trift, http and Spark session. dbt 1.5.0 introduced a new functional API, which greatly simplifies using the Spark session method with dbt. I prefer this option because:

  • It uses the same mechanism as is used for executing Pyspark/Scala Spark transformations.
  • It creates a Spark cluster on the fly for running our dbt models and is thus self contained. Many of the other setups require a Spark cluster to be created in advance, which introduces additional complexity of creating that cluster/shutting it down. The downside is that running your queries might be slower when new nodes need to be provisioned.

In order to use the Spark session in dbt, we need to create a main function, which executes our dbt command, and pass it to Spark submit. An example of such a main function running all dbt models in our project and defined in dbt_runner.py, is as follows:

from dbt.cli.main import dbtRunner


if __name__ == '__main__':
res = dbtRunner().invoke(
["run", "--profiles-dir", "<...>", "--project-dir", "<...>"]
)
if not res.success:
raise Exception(res.exception)

Now we need to pass the main function to the spark-submit script. This way we can run the dbt_runner.py against a local (e.g. for testing purposes) or a Spark cluster running in kubernetes:

# Locally
spark-submit --master='local[*]' src/dbt_runner.py --deploy-mode=client
# Executed by the submitter pod in a kubernetes cluster
spark-submit --master='k8s://https://kubernetes.default' --deploy-mode=cluster

Typically, these details are abstracted away for the user and all you have to do is fill in the dbt_runner.py in the appropriate property of the kubernetes crd (e.g. the application property in Conveyor). In our case the result looks as follows since the Spark job is triggered by Airflow:

ConveyorSparkSubmitOperatorV2(
dag=dag,
mode="local",
task_id=f"dbt-spark-tpcds-{modelName}",
application="local:///opt/spark/work-dir/runner.py",
application_args=["--model", f"{modelName}"],
driver_instance_type="mx.2xlarge",
instance_life_cycle="on-demand",
executor_disk_size=100,
)

User experience

I started from the Duckdb dbt models in order to see whether the syntax between Duckdb and Spark would differ as much as between Duckdb and Trino. Luckily, there were just a few syntax issues, namely:

  • Instead of using double quotes in alias names, you need to use backticks
  • The substring function needed to be replaced by the substr function

Benchmarking results

The following table shows the time it took to run the dbt models for all TPC-DS queries.

The following things stand out looking at the results:

  • In 75% of the cases Duckdb is the fastest in executing these queries. This is understandable since our setup (medium sized datasets) and running on single nodes favors Duckdb over Spark and Trino.
  • Spark is never the fastest in executing a query. This is for a large part the result of my choice for executing every query on a single node. This means that we run Spark in local mode, both the driver and executor run on the same node. For me this resulted in the fairest comparison, when looking at the cost of a dbt query. The cost of these queries is almost exclusively influenced by the core-hours used for executing the query.
  • All 3 SQL engines have a couple of queries that could not be executed. The reason differs, most popular errors are: out of memory and unsupported SQL syntax. I need to investigate these queries in more depth in order to see whether I can make them successful by changing the SQL syntax.
*Indicates the fastest SQL engine for a given query.

Conclusion

In this blogpost I explored the differences between 3 SQL execution engines for dbt, namely: Duckdb, Trino and Spark.

From a user experience perspective the following conclusions can be drawn:

  • all 3 engines support SQL but their SQL dialects differ, which results in small differences when migrating from one to the other.
  • Duckdb is the easiest to include in your dbt project as the other 2 require additional infrastructure to be installed. However, in many companies this infrastructure will be managed by a data platform team and thus be hidden for the teams developing dbt projects.

The TPC-DS performance benchmarks show that for medium sized (+- 100 GB) input data, which are processed on a single node, dbt with Duckdb is the fastest for 75% of the queries. In the other 25% of the queries Trino is the fastest execution engine.

--

--

Niels Claeys
datamindedbe

Data (platform) engineer @Data Minded with an fondness for distributed systems. Loves: AWS, K8s, Spark, Duckdb