DBT Core Introduction for Data Engineering

DBT Core Fundamentals

Published in

ADL AI & Analytics Corner

12 min readFeb 29, 2024

Data Build Tool known as DBT is a command line utility that assists data engineers in transforming data.

The field of data engineering is always changing, and DBT (Data Build Tool) is one of the most important tools to come out in the last few years. But what does DBT mean, and how can it help data engineers? The goal of this blog post is to answer these questions by giving a full overview of DBT Core.

In this article, I will discuss the principles of DBT Core and how it enables data engineers to construct strong data pipelines in an efficient manner.

1: What is DBT?

DBT allows us to write SQL queries without having to worry about dependencies.
Built in SQL
It also utilizes template engines such as Jinja SQL
We also can retrieve, rearrange and organize your data using additional logic and SQL.
DBT can save our time
It can simply optimize your data workflow

We know below are the common general requirements that data engineers expect from an analytic tool:

Explore dependencies between models
Data quality tests
Error reporting
Molding changes are easy to follow and revert
Incremental load of tables
Track the history of dimension tables
Easy to access documentation

DBT can fulfill those well and it can manage and facilitate those nicely.

2: DBT overview

3: Different DBT options /DBT products

We can access DBT using DBT Core or DBT Cloud. DBT Cloud is built around DBT Core

DBT core (CLI)
DBT cloud

Here, via this article we will discuss major key facts and introduction regarding only DBT core.

4: What’s DBT core?

DBT Core is free and open source. Individuals and companies wanting a cost-effective data transformation solution might consider DBT core.

What to expect from DBT Core:

Command-Line Interface (CLI): DBT Core operates primarily through a command-line interface, offering a familiar and efficient way to manage data transformations.
Data Source Flexibility: DBT Core supports a wide range of data sources, including popular cloud data warehouses like Amazon Redshift, Snowflake, and Google BigQuery.
Modular SQL: DBT Core leverages SQL as its primary language for data transformation, allowing data engineers to leverage their existing SQL expertise.
Testing and Documentation: Similar to the full DBT offering, DBT Core empowers data engineers to implement automated data testing and document their data models comprehensively.

However, it’s important to note that DBT Core requires a bit more technical setup and configuration compared to the cloud-based solution. This might involve managing dependencies and setting up connections to data warehouses manually.

5: DBT core major prerequisites

We can perform DBT on any OS like Windows, Mac or Linux.

To transform data in our project we must have a data warehouse (Snowflake, AWS Redshift, BigQuery etc.) DBT can create a connection to the data warehouse and run SQL codes against the warehouse. The below image shows the DBT supportive Data warehouse solutions.

Fig 2: DBT supported Data warehouse solutions

Version control system — To manage our code (Git, Git Lab, Azure Dev-ops)

We can have a set of usages for DBT from version control system like rolling back to previous versions, tracking changes, maintaining history, integration with CI/CD pipelines and collaboration and conflict resolution.

6: Special DBT tips you must know

Here are some of the tips we should be aware of DBT before starting the project.

DBT is compatible with Python 3.7, 3.8, 3.9, 3.10, and 3.11
But it would be better to install python 3.11 to match the latest Snowflake version (latest data warehouse version). [Here I used Snowflake as the data warehouse].
Don’t use python 3.12 since at this moment DBT doesn’t support it.
Use & activate the python virtual environment.
Ensure you have adequate permissions to create and manage database objects within your data warehouse.
If we experienced any issue on DBT, then we can easily figure it out from under /target/run directory

7: DBT core project main steps

Data warehouse creation & registration — [ex: Snowflake registration]
Dataset import
DBT core installation
DBT setup and data warehouse connection establishment (Snowflake connection)
DBT model implementation and tastings
Setting up DBT documentation server

By following the above main steps, you’ll be able to set up a DBT Core project, import your datasets, define data transformations, test your models, and generate documentation to support your data engineering and analytics workflows

8: What DBT can do for our data pipelines

Quickly and easily provide clean, transformed data ready for analysis.
DBT can enable data analysts to custom-write transformations through SQL SELECT statements.
Apply software engineering practice, such as version control, testing, and continuous integration/continuous deployment (CI/CD)
Build reusable and modular code using Jinja
Maintain data documentation and definitions within DBT as they build and develop lineage graphs.
Perform simplified data refreshes within DBT Cloud
Perform automated testing

9: DBT core installation

Here are the commands we need to execute to install DBT core on Linux (Ubuntu 20.04).

First, we should go into the project directory location and activate the python virtual environment.

Hint: We should have DBT supportive latest python version like python 3.11.7

Note: At this moment DBT doesn’t support python 3.12. So here I used python 3.11.7

python --version  

         Python 3.11.7

DBT can be installed via simple pip package,

Hint: If we are using Snowflake, we should give the most recent version for that

pip install DBT-Snowflake==1.7.1

Next, we need to create a DBT configuration (DBT profile) folder under your home directory.

mkdir ~/.dbt

Then we can create a DBT project from the command below.

dbt init test_dbt

After executing the above command, it may ask from you about your Data warehouse details. Here I just tried with test example and “test_dbt” is our DBT project name.

Hint: We should give correct data warehouse name & our database name etc. Database means here, when we import data into our data warehouse(Snowflake) we must create database there. On the other hand, a database should have been created in our Data warehouse. Once you create the DBT project you need to verify whether it’s working or not by running DBT debug command.

dbt debug

If everything is fine (including DBT — data warehouse connection), then you might see it like in Fig 4.

As per the above image(Fig 4) if all the tests are passed then we can start to work with DBT. Before starting to work with DBT there are a few key factors that we should have clear awareness of. Let’s look at the DBT project structure from the image (Fig 5). YAML files are commonly used here for configuration settings. (dbt_project.yml)

Let’s have a look on dbt_project.yml now (Fig 6). It has all the global configurations of our DBT project.

Note: Here our DBT project name is DBTlearn2

Inside dbt_project.yml file, there are included our DBT project name, DBT version, internal config version, DBT profile name and set of default folders named analyses, DBT packages, logs, macros, models, seeds, snapshots, target and tests. Before starting a DBT project for our use cases, we must have clear knowledge & understanding of the above major default folders created by DBT.

10: DBT core key facts

10.1: DBT Model

Models are the basic building block of your business logic
Materialized as tables, views, etc.
They live in SQL files in the ‘models’ folder
Models can reference each other and use templates and macros
We must use CTE (Common Table Expressions) to build the DBT model

10.1.1: CTE (Common Table Expressions)

--Syntax  

With <name of the result set> ([column_names]) 

AS 

( 

    <cte_query> 

) 

<reference_the_CTE>

The “with” clause tells the database we will create a common table expression. The “column _names” are completely optional, and they can be used to set up aliases for the column names that come out of the common table expression. The “cte_query” is a just select statement and it is just a temporary table that will be referenced from the “from’ clause. Furthermore, the example below can convey a very clear idea of CTE.

WITH raw_reviews
AS
(
 SELECT * FROM airbnb.raw.raw_reviews
)
SELECT
  listing_id,
  date AS review_date,
  reviewer_name,
  comments AS review_text, 
  sentiment AS review_sentiment 

FROM 

    raw_reviews

10.2: Materialization

Materializations are different ways, how our DBT model can store and manage in data warehouse. There are four built in materializations in DBT. View, Table, Incremental and Ephemeral are those types, and the default materialization format is view. We can config these in dbt_project.yml and you can see it from the bottom line at Fig 6.

If you pick one of them (view, table, incremental and ephemeral), then your DBT model will represent it. (ex: if you choose table as the materialization technique, then you can create a DBT model as a table, it can clearly be visible on DBT documentation home page in (Fig 14), will discuss it in later section)

Let’s see a nice comparison among those options (Fig 7).

When we don’t need actual materialization for our model, then we can use Ephemeral materialization.

10.3: Seeds & Sources

When we do data modeling, input data can come in two ways.

1: Data is ready from integration /ETL tools /ETL process.

2: Data set is ready in our pc/lap (only if you have small data set)

Seeds are the local files (from our local laptop) that we upload to the data warehouse from DBT.
Source is an abstraction layer on top of your input tables.
If you have already data in data warehouse, then it can be called source in DBT.
To make sure if our sources are up to date and get alerts, source freshness can do it.

10.4: Snapshots

Here we need to know about type-2 slowly changing dimensions.

Let’s assume that we must update a record, but we don’t only need to overwrite it, but we need to keep the history of the changes. This is something very known best practice in Data warehousing world and we call it slowly changing dimensions. So, what might be happening there, let’s see the below image, and it clearly describes it Fig 8 ,9 &10.

Need to update like this >

Then what DBT does in Data warehouse is like below. This is the DBT snapshot.

Here, “DBT_valid_from” means the date and time when the change happened, or initial data imported date. “DBT_valid_to” column has the details for when the last change happened. We can check the current version of data from column name called “DBT_valid_to”. If its value is null, then it means no changes happened (it might be the updated new record). If there is a value under column named “DBT_valid_to”, it should be an old record and it must have a new updated record. Fig 10, is called DBT snapshot.

10.5 Tests

Two types of tests are available in DBT. Singular & Generic.

Singular tests are SQL queries stored in the test folder.

DBT has four built-in generic tests

Unique : For a certain column we can define uniqueness constraints
not null : For a certain column we can define not null constraints
accepted_values : We can pass a list of values and those will be the only values that a column will accept.
relationships : When we take a column, and we need to make sure that every record in certain column is a valid reference to another record to another column in another table.

We can also define our own custom tests in DBT.

10.6 Macro & DBT test packages

Macros are Jinja templates created under the macro folder.
Also, there are many built-in macros in DBT.
We can use macros in model definitions and tests.
A special macro called test can be used for implementing our own generic tests.
Macros can be used for custom tests and for implementing reusable SQL code.

10.7 Documentation

Here what documentation can do is we can keep documentation as close to the actual source codes.
It can be done in two ways(1 & 2).
1> In YAML files
2> In standalone markdown files
DBT ships with a lightweight documentation web server
We can add our own assets (images) to a special project folder
DBT documentation keeps the doc as close to the project’s source code.
We can generate DBT documentation via DBT docs generate command.
Here then what happens is after executing this command, it checks all our YAML files and picks up the documentation sections and compiles it into the home page of documentation server.

Fig 11: DBT documentation catalog JSON file

After creating the JSON file, we can start documentation server with it.
We can run DBT docs to serve up the documentation server portal.
dbt docs serve > output is below

Here we need to make sure to check whether port 8080 is acquired by another process or not since we need to keep release port 8080 for DBT documentation server. And only if your DBT core implementation (models, tests, macro, snapshots etc)are successfully completed can you up the documentation server. Below is a sample view (Fig 13) of DBT documentation home page.

Fig 13: DBT documentation server home page

We can see everything what we did in DBT core previously in shell via documentation server. All the project details, database details, tables, SQL codes, compiled codes, and data lineage (Fig 15). From the (Fig 14) we can see our created DBT model name is “dim_listings_cleansed” and its materialization type is table. It is clearly visible here.

Fig 14: DBT doc server home page full view

Furthermore, from the overview tab in DBT documentation server it gives a detailed chart for the relationships among the data warehouse tables. (Fig 16)

10.9: Extending SQL with Jinja

A Python based template engine is used in DBT to write code and leverage open-source innovations.
Jinja can be used in combination with SQL to create dynamic and flexible SQL queries.
You can see Jinja syntax from the images below.

11: Useful DBT commands

dbt init : Initializes a new DBT project directory
dbt run : Executes all SQL files within your project, effectively running your data transformations.
dbt test : Runs any tests defined within your project to ensure data quality and consistency.
dbt build : Combines the functionality of DBT run, DBT test, and optionally DBT seed in a specific order (run, test, seed) to mimic a production deployment workflow.
dbt snapshot : Generates and compares snapshots of your data models to identify any discrepancies.
dbt show : Displays information about specific resources like models, tests, or seeds.
dbt source : Compiles your DBT project code into a single SQL script for specific use cases.
dbt docs generate : to generate DBT documentation Json file
dbt docs serve : To start the DBT documentation server

dbt show  

09:29:21 Running with DBT=1.7.7 

09:29:21 Registered adapter: Snowflake=1.7.1

12: Troubleshooting

If we experienced any issue on DBT core project, then we can identify by executing dbt run command. And error points will indicate in red color font. Then we might have to check under /target directory there we can see two directories called run and compiled. Under /run directory we can choose our project and go inside it and we can see the model, seeds, snapshots etc. Then we can check and figure out the root course for the issue and we can do troubleshooting there.

Conclusion

By utilizing the data warehouse’s built-in capabilities, DBT can significantly improve data workflows. Data engineers may optimize and enhance their data workflow efficiency with DBT cloud and DBT core. We should understand the DBT core, its key factors, and how to install and debug DBT before we start using it, so I’ve done my best to explain that in this first article about DBT.

Reference: