Transforming your data with ‘Data Build Tool’ (DBT)

Published in

cisco-fpie

5 min readJan 21, 2022

More conversations have been on ETL ( Extract Transform and Load) processes in recent years. There are also several tools for the process, and one of them is DBT — a software developed as an in-house tool by the then consulting firm, Fishhouse analytics, to meet the needs of their clients. I recently spent a couple of days experimenting with dbt. I decided to put together an article that captures what DBT offers and how DBT fits in the overall data landscape.

Dbt is a transformation tool in the ELT pipeline that lets teams transform data following software engineering best practices such as modularity, portability, CI/CD, and documentation. Through native support for connectivity to Snowflake, Amazon Redshift, Google BigQuery, and PostgreSQL, dbt runs SQL queries against your database, warehouse, platform, or query engine to materialize your data as tables and views. It’s important to note that DBT does not extract or load your data to a warehouse; it focuses on transforming it after ingestion. It covers only the T of the ELT process.

Dbt adopts the ELT(Extract Load Transform) model over ETL (Extract Transform Load). This means that instead of the ETL approach, which requires data moving from the source to the staging area for transformation, we adopt the ELT approach, which transforms data directly in the warehouse. There is no need for data staging.

What Does DBT Offer?

Repeatable codes

The exciting thing about dbt is that it allows for the reusability of SQL code. The benefit of this is a reduction of complex scripts. With dbt, all models are SELECT statements, which can be reused through YAML configuration files and Jinja language. Instead of building intermediary tables that would allow the reuse of codes, DBT uses macros (snippets of SQL codes) to avoid repeating code across multiple models. Macros are defined in .sql files in your macros directory and can be reused numerous times through referencing. Dbt allows you to combine SQL with Jinja, which brings programming aspects into SQL queries, allowing you to write if and for loops in SQL.

Essentially, with dbt, you build SQL models on top of source tables, which other models can reference using the ref function. Using the ref function ensures that the current model selects from the upstream tables and views. Dbt also knows the order in which to run these models, so there’s no need to define dependencies explicitly.

SQL without Jinja: The payment_method is repetitive and can be challenging to maintain

Same SQL, but with Jinja: The repeated code above Is replaced with a for loop.

dbt comes in two versions: dbt core and dbt cloud. dbt Core includes a command-line interface (CLI) that enables users to execute dbt commands using a text editor. dbt Cloud is a web-based IDE that does everything the CLI does, with additional features such as lineage, orchestration, documentation, and an automated Git workflow.

Automated Documentation and Directed acyclic graph (DAG):

dbt allows you to generate documentation for your project. dbt goes through the project and the warehouse to create a file with information. This includes your actual code, a DAG, tests, column data types, and tables. You can add additional model details such as tests and descriptions in YAML for more comprehensive documentation. You can generate documentation by running ‘dbt docs generate,’ generating a catalog.json file in your target directory.

`dbt goes through` your project and your warehouse to generate documentation like this about your project

The DAG offers an overview of data sources involved in a particular transformation process and the data flow from source to target. DBT interpolates schemas to infer relationships between models and automatically updates the DAG based on references. This visualization can be helpful for less technical people who are not interested in the technological process but want an overview of how data flows throughout an organization. You can also include pre-and post-hooks in your models for effective database administration.

Automated Testing

dbt comes with inbuilt tests such as:

Unique: this checks if all the values in the columns are unique.
Accepted_values: this compares the content of a column against a list of accepted values.
Relationships: this checks if the values in a column exist in a different table.
Not_null: this checks if a column contains null values.

You can also write custom quality test cases to check for specific business logic by combining Jinja and SQL. These tests can be configured by referencing them in the same YAML file used for a given model. If the data fails to operate under the set expectations in the YAML file, dbt flags the error.

An example of a schema file with descriptions and tests

Deployment

Users can publish public and private repositories using dbt’s built-in package manager, and other users can refer to these. dbt Cloud offers an automated git workflow. It connects with GitHub to automate your continuous integration pipelines. This means you don’t have to worry about orchestration.

Jobs are dbt commands (dbt run and dbt set) set to run on a schedule.

You can specify when and how dbt cloud should trigger this job

Flexibility:

DBT also offers maximum flexibility. If the project structure produced doesn’t fit your needs for some reason, it can be customized by editing the project configuration file (dbt_project.yml) to rearrange folders as you prefer.

dbt has comprehensive documentation, which offers step-by-step guidance on project installation and initialization, creating a profile, and running models. I tried out the DBT cloud with BigQuery as an adapter, but you can choose any accepted adapters.

I would recommend anyone working with the ELT approach to try out dbt. It is important to note that users must know how to write efficient SQL to maximize dbt. You can join the DBT community or contribute to the project on their GitHub page.

Transforming your data with ‘Data Build Tool’ (DBT)

Written by Chidumga Izuzu