How I do dbt (data build tools)
This article will discuss things (tips on how to) that I learned from my experience when setting up dbt Project for my team that I think can be valuable for other data engineering teams when trying these new great tools.
For those who are new to dbt, it's basically a data transformation tool that follows software engineering best practices. dbt lies in the T part of ELT (extract, transform, load) processes.
Of course (obviously, literally) dbt is an easy to use data transformation tool that requires minimal effort to set up especially for those who use cloud version (cloud dbt) that already include all “data engineering basic functionality” like CICD, Slack Alerting, Scheduler, and Documentation, but in our engineering hand there is an always room for improvement.
improvement referred to increase ease of use and convenience for our data team (BI and DA) that use dbt on daily basis.
Organize everything with dbt_project.yml
First thing first, organize everything and I do it with dbt_project.yml. this is the main configuration file that defines dbt project structure, like how to structure files and define variables.
- Call pre-hook and post-hook that you need here. as an example by adding start_log() and end_log() macros as hooks in model components, it will log each SQL model execution to table log. I also add grant_table() as a post-hook that grants a table to the default or specific user per-schema after each model is successfully created.
- Define global variables and settings, other than hook I am also defining global variables and default settings, in this example, I am defining global variables for schema aliasing, default materialization settings to table and persist_docs enable both for columns and relation.
- Structure data layer, the last one is to organize data layer by directory structure, in this example, I try to organize SQL model by dir name as data layer (data_mart directory for data_mart layer), so when there is a request to create a table on data_mart layer just need to place SQL file in data_mart directory. you can organize it in other ways, like business units or services.
Maximize macros usages
Macros are a great way to add functionality to your dbt project. Notice that in the previous point I use grant_table() macro as a hook that automatically grants a newly created table to users.
another example of macros that I use is: source(), generate_schema_name(), primary_key(), and timezone_default(). Apart from defining macros as functions, they can also be used as templates to replace repetitive queries (ex: index(), partition()).
SQL model that uses macros
macros usages is depends on data warehouse (PG, Snowflake, Bigquery, etc) that you use, in my case i use macro partition() to partition table in Postgres but in BigQuery this macros not usable since dbt have a built in macros to handle this table partition (partition_by() and cluster_by() macros)
Define model templates
Model templates allow you to start developing SQL models with predefined per-team templates and settings. This can be useful to unified SQL model and table across users/teams that use dbt. As in our case, I define model template.sql and template_incremental.sql (for incremental models) so the user can use and change template models as needed.
template models can also be use as a guidelines on how to use macros or any dbt configuration properties
Make use of schema.yml
“Documentation is a love letter that you write to your future self”
~ Damian Conway
well, turn out it's not just for your future self, but also your future scripts. schema.yml main use is to provide documentation for your dbt project, its other use is to store models metadata. later when the SQL model is executed this metadata will be aggregated with invocations logs/metrics and produced as a dbt artifacts files(target/run_result.json, target/manifest.json), which later can be used for your data monitoring and alerting platform, as in our case, metadata model ownerships will be used in slack alerting/notification and other metadata for pipeline reporting/logging.
Cloud dbt as a collaboration tools
Cloud dbt is great for collaboration and developing SQL models but only good as production environments, so we don't use it as our production environments. and… we data team, basically have a centralized orchestration platform (like airflow, rundeck, etc) to organize our data pipeline across multiple data platforms. As in our case, we host dbt environments separately, development environment on cloud dbt, and production environment on On-premise with dbt-cli. dbt makes it easy to maintain separate production and development environments through the use of targets within profiles.yml
Those who use dbt on one BigQuery project (or similar) just need to specify targets and override these two built-in macros(source() and generate_schema_name()) to automatically change environments and schema based on target properties. Under the hood dbt use macros source() to querying data from table source, and generate_schema_name() when referring to the target schema when creating a table
I think that's it for now…
You might have different opinions or preferences regarding how to set up dbt Project for your team, so feel free to jot down some ideas/advice on how you set up dbt Project.
Did you find this blog post helpful? If so, then bring this article to your next DE team weekly meeting, or… you can just share it so people can find this article too. Thank You.