Coding Symphony: Unleashing the Power of ClickHouse and dbt for Modern Data Analytics (part 1)

Brenda Varguez
6 min readApr 8, 2024

dbt (data build tool) is a popular data transformation and modeling tool. It allows data engineers to define and execute transformations on their data, enabling them to create reliable and reusable data pipelines. With dbt, users can build and maintain data models, apply business logic to their data, and transform raw data into a structured and analysis-ready format. dbt also provides features for testing and documenting data models, ensuring data quality and facilitating collaboration among team members.

Prerequisites

1. This tutorial assumes that you already have a ClickHouse instance up and running, and stable enough to connect to dbt. If you don’t have one, feel free to review my previous post Deploying ClickHouse on a Kubernetes cluster using Terraform, where I show the first steps to deploy a CH instance.

2. Create a python virtual environment and install both dbt-coreand dbt-clickhouse. As of today, these are the latest versions.

python3 -m venv my_env
pip install dbt-core==1.7.7
pip install dbt-clickhouse==1.7.2

3. Create a new dbt project with dbt initand give it a name. This will generate a bunch of folders with some example scripts that can be used as templates.

Configuring dbt

The files created in this tutorial can be found in this github repository under the ch_dbtfolder.

dbt profile

Let’s begin by creating a profiles.ymlfile where dbt will find the connection details to connect to ClickHouse (or any other database or data platform). For further documentation about profiles see. For CH specifics see.

I’ll create two configs, one called devfor testing locally, and another one called prdto read and write to the production CH databases. As you can see, both outputs are pretty similar, the only configs that should change are:

  • host: host of your CH instance
  • schema: ClickHouse database where your models will be built. warehouse_dev for local tests, and prod_db for production. These databases are the default ones, however, if you want to create a model in a different database you can specify it in the model macro. We’ll see this later.
  • user & password: You should have at least two sets of credentials (users and passwords) to provide different access levels. For example, the user you use for testing shouldn’t have access to your production databases. For more info, review my last post Creating multiple users in ClickHouse from Terraform.
  • In addition, credentials are not hard coded, they are passed as environment variables, i.e., I export them by running export CLICKHOUSE_USER=username and export CLICKHOUSE_PASS=string in the terminal. Then, dbt is able to read them with the env_var function, which is used to incorporate environment variables from the system. This function can be used almost anywhere.

My ClickHouse instance consists of 1 shard with 3 replicas. For that reason, when creating a new table I need to make sure that it’s created in all replicas. To do that I’m creating them with the ReplicatedMergeTree engine, but additionally it is required to add the ON CLUSTER {cluster_name} clause in each query. So, instead of individually adding that setting into every model, it can be easily specified in this file with the cluster option. I will further explain why this engine was chosen in the #seeds section.

analytics:
target: dev # default schema where data will be sent

outputs:
dev: # this one will be used for testing locally
type: clickhouse
schema: warehouse_dev # Default ClickHouse database where your models will be built
user: "{{ env_var('CLICKHOUSE_DEV_USER') }}"
password: "{{ env_var('CLICKHOUSE_DEV_PASS') }}"
port: 9000
host: localhost
cluster: default
retries: 1
threads: 4 # max number of models to run at the same time
connect_timeout: 10
send_receive_timeout: 3600 # 1hr
use_lw_deletes: True # `delete+insert` as the default incremental strategy

prd:
type: clickhouse
schema: prod_db
user: "{{ env_var('CLICKHOUSE_USER') }}"
password: "{{ env_var('CLICKHOUSE_PASS') }}"
port: 9000
host: clickhouse.data.svc # service-name.namespace.svc (K8s)
cluster: default
retries: 1
threads: 4
connect_timeout: 10
send_receive_timeout: 3600 # 1hr
use_lw_deletes: True # `delete+insert` as the default incremental strategy

config:
send_anonymous_usage_stats: false

dbt project

General configs

We will also need a dbt_project.yml file. It tells dbt how to operate your project, the path to your models, seeds, macros, etc. Unlike the profiles files, this one should have been previously created when you ran dbt init. In my case, I’ll add some extra configs that will be useful according to my ClickHouse set up.

According to dbt itself, a project name should be your organization’s name or the purpose of the models. Also, make sure to pass the target profile declared in profiles.yml. You will also need to specify where dbt should look for different types of files. E.g., the model-paths config states that models in this project can be found in the models/ directory. However, it’s highly unlikely that you need to change any of those configs.

Models

There are two ways of adding up extra settings to your models:

  • By individually declaring them within specific models using the '{{ config(...) }}' macro.
  • By including them in this file.

I strongly advise you to add in here the global configs, meaning, the configs that you want to use to affect all models. On the other hand, if there is a setting you need to specifically affect one or a few models, then you’d want to use '{{ config(...) }}' in that model. We’ll review an example of this later.

As previously mentioned, I’m using the ReplicatedMergeTree in the query used to create my dbt models. I’m making that specification here and adding + as a prefix to indicate that this config should be applied to all my models. Moreover, I need some models to be built as tables, and others as views. So, as you can see, I’m also declaring some extra configs for my two sub-folders inside models\\ . Everything inside the prd\\ folder will be built as a table, while everything inside google_analytics\\ will be built as a view.

Just take into account that whatever you’re defining here can be overridden by the '{{ config(...) }}' macro.

Seeds

Seeds are CSV files that dbt can load into your warehouse. Using seeds is recommended when the information within those files is not sensitive, and when the data is static or when its frequency change is low.

Within the seeds/CSV files you cannot add extra settings like in the models. So you will need to declare them in this file. By default, dbt will build your seed files as tables, and will use the MergeTree engine to create them. However, with that engine the tables will not be present in all nodes within my cluster, so I also need to specify that my desired engine is ReplicatedMergeTree , by using the option +engine as global setting within seeds configs.

You can continue individually customizing both models and seeds in a properties file. I’ll show you how in other posts.

# project name
name: 'ch_dbt'
version: '1.0.0'
config-version: 2

# This setting configures which "profile" dbt uses for this project (profiles.yml).
profile: 'analytics'

# These configurations specify where dbt should look for different types of files.
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

clean-targets: # directories to be removed by `dbt clean`
- "target"
- "dbt_packages"


# Configuring models
# Full documentation: https://docs.getdbt.com/docs/configuring-models

# dbt will build all models in the prd/ directory as tables.
# These settings can be overridden in the individual model
# files using the `{{ config(...) }}` macro.
models:
+engine: ReplicatedMergeTree
ch_dbt: # name of your project
# Config indicated by + and applies to all files under
# models/prd/ folder
prd:
+materialized: table

# Config indicated by + and applies to all files under
# models/google_analytics/ folder
google_analytics:
+materialized: view

seeds:
+engine: ReplicatedMergeTree

Congrats! You’ve just successfully created your first dbt project and connected dbt with ClickHouse. Check the second part of this post to learn how to build your first dbt models and seeds, and continue adding a bit more of configs.

--

--

Brenda Varguez

Data Engineer @ JSR Life Sciences based in Mexico. In overall I'm a data driven person and ClickHouse enthusiast.