DBT Guide (share version)
This guide was created to introduce DBT to our team.
The main objective is to provide clear and consistent instructions and standards, to make it easier for users to get started, and to avoid some of the issues we encountered before.
Some practices are based on articles from the DBT community and Gitlab Data Team. The relevant links are provided within the guide.
這個 Guide 是先前 team 內決定要嘗試導入 DBT 時,順帶為了解決組織遇到的問題所訂出的使用規範,主要目的是希望透過清楚統一的說明和規則,讓後續使用者容易上手,也讓 DE 的日子好過一點(苦笑
部分結構是參考 DBT 社群及 Gitlab Data Team 的文章,在 Guide 裡面可以找到相關連結
感謝 DBT local taipei 主辦 Karen 鼓勵我發文
希望大家在資料這條路上持續教學相長不孤單 :)
Notion version
What is DBT?
DBT is the T in ELT.
It doesn’t extract or load data, but it’s extremely good at transforming data that’s already loaded into your warehouse. This “transform after load” architecture is becoming known as ELT (extract, load, transform).DBT is a compiler and a runner
At the most basic level, DBT has two components: a compiler and a runner. Users write DBT code in their text editor of choice and then invoke DBT from the command line. DBT compiles all code into raw SQL and executes that code against the configured data warehouse.
How we structure our DBT projects?
and Best practices
✨ Encourage to add data descriptions to YAML!!!
Data stage
- Sources:
Schemas and tables in a source-conformed structure (i.e. tables and columns in a structure based on what an API returns), loaded by a third-party tool. https://docs.getdbt.com/docs/building-a-dbt-project/using-sourcessrc_{source}__{object}
Table - Staging models:
The atomic unit of data modeling. Each model bears a one-to-one relationship with the source data table it represents. It has the same granularity, but the columns have been renamed, recast, or usefully reconsidered into a consistent format.stg_{source}__{object}
View
DateTime relate columns should be transformed to epoch sec or timestamp with no timezone and specific show on the column name.
{what ever}_{date/timestamp}_{epoch/utc/tz}
epoch → int
utc → timestamp (with no time zone)
tz → timestamp (with time zone)Please add a specific time zone when using this type as the condition.
Otherwise, it may get unexpected results on different platforms.
Good way
where stored_to_backend_timestamp_tz >= '2020-03-14'::timestamp at time zone 'Asia/Jakarta'
Bad way
where stored_to_backend_timestamp_tz >= '2020-03-14'
There is a primary key that is both unique and not null (and tested).
- Marts models:
Models that represent business processes and entities, abstracted from the data sources that they are based on.{dim/fact}_{object}_{t/v}
Table or View
Table or View. Name with a suffix type.
fact_{object}_{t/v}
: A tall, narrow table representing real-world processes that have occurred or are occurring. The heart of these models is usually an immutable event stream: sessions, transactions, orders, stories, and votes.
dim_{object}_{t/v}
: A wide, short table where each row is a person, place, or thing; the ultimate source of truth when identifying and describing entities of the organization. They are mutable, though slowly changing: customers, products, candidates, buildings, and employees.
summary_{aggregated time window}_{object}_{t/v}
: Aggregation based on time window or other dimensions.
Folder structure
marts
- core:
engineer maintenance
- de:
engineer maintenance
Special schema for intermediate transform. - dim/fact_{}_t/v/cte/it.sql
- cte
- it → incremental table
staging
engineer maintenance
- Service group.
Data warehouse structure
Which data can we use?
As an analyst in the business group
Please only use models from dbt_mart_core
or dbt_mart_{your business group}
. Any model from other schemas may change fast and engineer won’t keep the old version.
DBT Style Guide
linter
formatter
Tag Strategy
Currently, we don’t have strict limits on tags. But some tags are used for airflow triggers. Please be careful and avoid using them in the wrong way.
Major ETL pipeline
daily-run
daily-test
DS ML model
- prefix:
ds-{*}
{*}_full-run
refresh all data{*}_batch-run
only fetch recent data
Anomaly detection
anomaly
Unit test for testing the app event v3 and incremental result
unit-test
QA test for new app version
qa
Retire flag (TBD) Let data users know when it will be removed.
retire-{date}
retire-20230201
Backfilling for app event v1 will be removed once the v1 version is retired.
dev-backfill
Workflow
Potential issues
- There may be conflicts when multiple developers deploy at the same time.
dbt_mart_core
would be fat in the future.
Ref
Gitlab
dbt(Data Build Tool) Tutorial
Custom schema
- https://docs.getdbt.com/docs/building-a-dbt-project/building-models/using-custom-schemas
- https://docs.getdbt.com/docs/guides/debugging-schema-names
Debugging
The
target/compiled
directory containsselect
statements that you can run in any query editor.The
target/run
directory contains the SQL dbt executes to build your models.The
logs/dbt.log
file contains all the queries that dbt runs and additional logging. Recent errors will be at the bottom of the file.