DBT Guide (share version)

JF
dbt-local-taipei
Published in
4 min readMar 9, 2023

This guide was created to introduce DBT to our team.
The main objective is to provide clear and consistent instructions and standards, to make it easier for users to get started, and to avoid some of the issues we encountered before.
Some practices are based on articles from the DBT community and Gitlab Data Team. The relevant links are provided within the guide.

這個 Guide 是先前 team 內決定要嘗試導入 DBT 時,順帶為了解決組織遇到的問題所訂出的使用規範,主要目的是希望透過清楚統一的說明和規則,讓後續使用者容易上手,也讓 DE 的日子好過一點(苦笑
部分結構是參考 DBT 社群及 Gitlab Data Team 的文章,在 Guide 裡面可以找到相關連結

感謝 DBT local taipei 主辦 Karen 鼓勵我發文
希望大家在資料這條路上持續教學相長不孤單 :)

Notion version

What is DBT?

What, exactly, is DBT?

DBT is the T in ELT.
It doesn’t extract or load data, but it’s extremely good at transforming data that’s already loaded into your warehouse. This “transform after load” architecture is becoming known as ELT (extract, load, transform).

DBT is a compiler and a runner
At the most basic level, DBT has two components: a compiler and a runner. Users write DBT code in their text editor of choice and then invoke DBT from the command line. DBT compiles all code into raw SQL and executes that code against the configured data warehouse.

How we structure our DBT projects?

and Best practices

✨ Encourage to add data descriptions to YAML!!!

Data stage

  • Sources:
    Schemas and tables in a source-conformed structure (i.e. tables and columns in a structure based on what an API returns), loaded by a third-party tool. https://docs.getdbt.com/docs/building-a-dbt-project/using-sources
    src_{source}__{object} Table
  • Staging models:
    The atomic unit of data modeling. Each model bears a one-to-one relationship with the source data table it represents. It has the same granularity, but the columns have been renamed, recast, or usefully reconsidered into a consistent format.
    stg_{source}__{object} View

DateTime relate columns should be transformed to epoch sec or timestamp with no timezone and specific show on the column name.
{what ever}_{date/timestamp}_{epoch/utc/tz}

epoch → int
utc → timestamp (with no time zone)
tz → timestamp (with time zone)

Please add a specific time zone when using this type as the condition.
Otherwise, it may get unexpected results on different platforms.

Good way

where stored_to_backend_timestamp_tz >= '2020-03-14'::timestamp at time zone 'Asia/Jakarta'

Bad way

where stored_to_backend_timestamp_tz >= '2020-03-14'

There is a primary key that is both unique and not null (and tested).

  • Marts models:
    Models that represent business processes and entities, abstracted from the data sources that they are based on.
    {dim/fact}_{object}_{t/v} Table or View
    Table or View. Name with a suffix type.

fact_{object}_{t/v}: A tall, narrow table representing real-world processes that have occurred or are occurring. The heart of these models is usually an immutable event stream: sessions, transactions, orders, stories, and votes.

dim_{object}_{t/v}: A wide, short table where each row is a person, place, or thing; the ultimate source of truth when identifying and describing entities of the organization. They are mutable, though slowly changing: customers, products, candidates, buildings, and employees.

summary_{aggregated time window}_{object}_{t/v}: Aggregation based on time window or other dimensions.

Folder structure

marts

  • core: engineer maintenance
  • de: engineer maintenanceSpecial schema for intermediate transform.
  • dim/fact_{}_t/v/cte/it.sql
  • cte
  • it → incremental table

staging

  • engineer maintenance
  • Service group.

Data warehouse structure

Which data can we use?

As an analyst in the business group

Please only use models from dbt_mart_core or dbt_mart_{your business group}. Any model from other schemas may change fast and engineer won’t keep the old version.

DBT Style Guide

linter

formatter

Tag Strategy

Currently, we don’t have strict limits on tags. But some tags are used for airflow triggers. Please be careful and avoid using them in the wrong way.

Major ETL pipeline

  • daily-run
  • daily-test

DS ML model

  • prefix: ds-{*}
  • {*}_full-run refresh all data
  • {*}_batch-run only fetch recent data

Anomaly detection

  • anomaly

Unit test for testing the app event v3 and incremental result

  • unit-test

QA test for new app version

  • qa

Retire flag (TBD) Let data users know when it will be removed.

  • retire-{date}
  • retire-20230201

Backfilling for app event v1 will be removed once the v1 version is retired.

  • dev-backfill

Workflow

Potential issues

  • There may be conflicts when multiple developers deploy at the same time.
  • dbt_mart_core would be fat in the future.

Ref

Gitlab

dbt(Data Build Tool) Tutorial

Custom schema

Debugging

The target/compiled directory contains select statements that you can run in any query editor.

The target/run directory contains the SQL dbt executes to build your models.

The logs/dbt.log file contains all the queries that dbt runs and additional logging. Recent errors will be at the bottom of the file.

--

--