Practical Guide to DBT: Unraveling Architecture and Initial Configuration.

A 3-step journey to mastering the data transformation tool — DBT — Part 1 of 3.

Alice Thomaz
10 min readJun 25, 2023

This will be the first in a series of articles aimed at guiding and explaining more about DBT, a widely used data transformation tool in the current market. The purpose of this article is to introduce the tool, provide an overview of the motivations and architecture implemented at Pebmed, a company within the Afya group, and present a step-by-step guide to getting started with the tool locally.

>>But what exactly is DBT?

DBT, or Data Build Tool, is a data transformation tool that focuses on SQL queries, enabling data engineers and analysts to manipulate their Data Warehouse more efficiently.

Below, we can visualize the transformation flow in the tool. DBT has native integration with leading cloud Data Warehouse platforms. By reading raw data from these providers, we can go through the following steps:

  1. Development: Here, it is possible to write transformation code in .sql and .py files. It is important to note that Python code availability may vary across platforms, so it is necessary to consult the official website for confirmation.
  2. Testing and documentation: It is possible to perform local tests on all models before submitting them to the production repository. Additionally, it is possible to document everything that is created, such as tables and columns, add custom information, and visualize the lineage of the models. This allows for quickly understanding dependencies and conducting impact analysis for potential changes.
  3. Deploy: It is possible to safely deploy to the development environment. Version control enabled by Git allows for multiple collaborations without causing major damage to the final production environment.

>>Implementing the Tool at Afya

The tool was initially being used in one of the companies acquired by the group since 2020, iClinic, and in 2022 we implemented it in the other pillars. In this article, I will focus on the process carried out in the Pebmed BU.

The DBT implementation project aimed to:

  • Empower the analysis and data science chapters to perform business data transformations through SQL queries.
  • Create an environment that serves as a single source of truth, with well-organized and documented business metrics, insights, and definitions.
  • Provide resources for defining tests, impact analysis through table lineage, and creating alerts to identify issues.
  • Provide freedom to the engineering team regarding the orchestration platform by replacing Snowplow BDP with DBT Cloud.

Architecture

Currently, there are two versions of DBT:

  • DBT Core: It is an open-source command-line tool that allows for local data transformation following best practices.
  • DBT Cloud: It is a web interface that enables fast and reliable implementation of DBT. Through this interface, it is possible to develop, test, schedule, and visualize models.

At Afya, we chose to work locally with the development and testing of our environment. With the support of an Ubuntu image, we performed the configuration processes of a Docker container that is the same for all users. A container is an isolated unit that packages an application along with all its dependencies, allowing for consistent and portable execution in different environments. With this, we were able to adapt and isolate a unique environment for local development.

We follow the flow shown in the image below, where users can collaborate with our production repository on GitLab, which is run daily with the help of DBT Cloud as the orchestrator. Since our architecture is designed on AWS, we chose to continue using Redshift as our Data Warehouse tool. Thus, DBT Cloud reads the models in the repository and divides them into execution stages to generate processed and updated data in the DW. These data are subsequently used in dashboards presented to the business areas through Power BI, which establishes an ODBC connection with Redshift.

>>Setting up the computer

The process of configuring your computer will be divided into two parts: first, we will configure Docker, and then VSCode.

Docker

Docker is a virtualization platform that allows you to create and run applications in isolated containers, providing a secure and customizable environment for dbt model development. To configure it, follow the steps below:

1. First, let’s install the tool. You can find the installer at this [link]. If you are using Windows, Docker may ask you to install WSL.

2. Next, choose a folder of your preference, preferably within your user directory, and create a folder named “dbt-config” with the following listed files:

2.1. dbt-config > Dockerfile

FROM ubuntu:20.04
COPY ./.dbt /root/.dbt

RUN apt update && apt upgrade -y
RUN apt install telnet -y
RUN apt install vim -y
RUN apt install git -y
RUN apt install python3 -y
RUN apt install python3-pip -y
RUN pip install dbt-core==1.2.2 dbt-redshift==1.2.1

This file will be responsible for configuring our container. There are two points to pay attention to in the last line:

  • The dbt-core version is constantly updated, so it’s important to keep up with the official dbt pages to stay informed about updates. However, be cautious about version changes to avoid conflicts, especially if you are using DBT Cloud as your orchestrator. It’s ideal to keep both in the same version.
  • The version issue also applies to the chosen adapter. In our case, since we work with Redshift, we install the corresponding adapter for it. Make sure to use the adapter that matches the Data Warehouse defined by your company in this part.

2.2- dbt-config > .dbt > profiles.yml

When invoking dbt via the command line to execute models and processes, it reads your profile to establish a connection with the database and identify some environment customizations.

Before proceeding to the next steps, you should create the profiles.yml file and fill in the information with your own data. Below is an example used here at Afya:

config:
partial_parse: false

pebmed:
target: dev
outputs:
dev:
type: redshift
host: localhost
user: your_dw_username
pass: your_dw_password
port: 5439
dbname: database_name
schema: dbt_eng
threads: 4
prod:
type: redshift
host: dw_host
user: your_dw_username
pass: your_dw_password
port: 5439
dbname: database_name
schema: dbt_prod
threads: 4

You can have as many profiles as necessary, with 1 profile per Data Warehouse. However, typically companies work with a single DW. Let’s explain the main topics separately:

  • Profile name: Replace it with the name of your dbt project. Usually, the company name is used as a reference. Here, we use “pebmed”. The profile name should be the same as in the “dbt_project.yml” file, which is the document with the main configurations of your project.
  • Target: You can create multiple targets, but in this initial field, you should define the default one. Whenever you run a model and don’t specify a specific target, the default one defined here will be used. Here, we use “dev” as the default.
  • Outputs: Here, we will separate them by target. At Afya, we use the “dev” target for all users, allowing us to include some restrictions in the environment. Additionally, for the engineering team that needs to perform constant maintenance in production, we also include a “prod” target.
    - Type: The type of Data Warehouse you are connecting to.
    - Credentials: In this part, you should include the credentials of your Data Warehouse, such as the local host, username, password, port, and database name.
    - Schema: The schema prefix in which dbt will generate the objects. Here, we use 3: “dbt_prod” for the production environment, “dbt_analysis” for the testing environment of the science and analysis team, and “dbt_eng” for the testing environment of the engineering team.
    - Threads: The maximum number of models that DBT can execute simultaneously, the default is 4, but you can customize it in this field.

Note: On Mac machines, files that start with “.” are usually hidden. To view them, you may need to use the following command: Cmd + Shift + “.”.

3. If you are using a Windows computer, open “Windows PowerShell”. If you are using a Mac, open the Terminal.

4. With the application open, navigate to the folder where you saved the “dbt-config” file. To do this, you can go back one folder using the command “cd ..” or go into a folder using the command “cd folder_name”.

5. Once inside the “dbt-config” folder, run the following command. This will initiate a series of installations. Wait until all the steps are completed.

docker build -t ubuntu:20.04 .

6. Now execute the following command. It will create a new container with the Ubuntu image, and you will enter directly into it due to the “-it” command.

docker run -it --name dbt_pebmed ubuntu:20.04

Note: You can choose a name of your preference for the container. Here, we use the default “dbt_” + company_name.

7. Execute the command “cd root”.

8. From this point, we will have two different paths. If you haven’t created a project in dbt yet, create a folder with the chosen name using the command “mkdir project_name” (at Afya, we use the name “dbt-dev”) and proceed to the configuration of your VSCode. More details on which folders to create and their functions will be explained in Part 2 of this guide.

If you already have a dbt project created in an online repository, clone it locally. To do this, access your repository, go to the clone option, and copy the HTTPS link. Then, execute the following command:

git clone link_https

Note: It will prompt for your username (email) and password (token). If you haven’t created a token yet, in the case of Gitlab, go to “edit profile > access tokens” and create a token with “read_repository” and “write_repository” permissions enabled.

An important tip is that when entering your password in the terminal, you won’t see anything happening, but the password will be hidden. Just paste or type the password once and press Enter for the clone command to be executed normally.

9. With the repository cloned, access the destination folder using the command “cd folder_name”.

10. If you have packages installed in the project, install them by running the command “dbt deps”.

Now your Ubuntu container is set up and ready to use. You can close the terminal now. In the next part, we will provide some tips on how to configure VSCode to make it easier and customize its use in your daily workflow.

VSCode

Visual Studio Code (VSCode) is a flexible and highly customizable source code editor that is widely used in software development and will assist us during the creation and reading of models in DBT. To configure it, follow the steps below:

1. First, install the tool through this [link].

2. Open the tool and install the following extensions:

Note: If you are using a Mac, there is no need to install the WSL extension.

3. After installing the extensions, connect to the container: Click on the Docker icon > Right-click on the selected container > Attach Visual Studio Code.

Note: This will open a new tab in Visual Studio Code.

4. Now, open a new terminal inside VSCode and execute the following commands in two parts:

  • Part 1:
code --install-extension bastienboutonnet.vscode-dbt && 
code --install-extension donjayamanne.githistory &&
code --install-extension dorzey.vscode-dbt-language &&
code --install-extension eamodio.gitlens &&
code --install-extension foldager.dbt-shortcuts
  • Part 2 :
code --install-extension innoverio.vscode-dbt-power-user && 
code --install-extension samuelcolvin.jinjahtml &&
code --install-extension mhutchie.git-graph &&
code --install-extension PKief.material-icon-theme

5. After installing the extensions, follow the steps below:

  • Press the buttons: Ctrl + Shift + P
  • This will open a search bar at the top of VSCode. Search for the option: “Preferences: Open User Settings (JSON)

A new tab called “settings.json” will open. Paste the code below:

{ 
"workbench.iconTheme": "material-icon-theme",
"workbench.colorTheme": "Bluloco Dark",
"editor.minimap.enabled": false,
"launch": {

"configurations": [],
"compounds": []
}, "files.associations": {
"*.sql": "jinja-sql"
},
"editor.rulers": [
110
],
"editor.tabSize": 2
}

Note: In the “icon” and “color” topics, you can set your preference, as some people prefer a black background while others prefer a white background.

6. If you want to keep your project in focus so that when you open VSCode, it goes directly to it, go to “File > Open Folder” and select the folder of your project, whether it’s the new folder or the cloned folder. Example: /root/dbt-dev

In day-to-day usage

With the configurations completed, to work with the tool, simply open VSCode, start the container using the Docker icon, and perform the attach.

It is recommended to close the environment whenever it is not in use. To do this, close the extra tab, follow the same steps mentioned above to access the container, and then right-click to stop it. Besides freeing up processing space on your machine, this helps prevent unexpected computer issues from affecting the progress of your project.

Note: Some VSCode instances directly open in the container tab, in which case you can start and stop directly from the Docker tool in the Containers tab.

The first part of this article aimed to provide an overview of the tool’s possibilities, give an initial introduction, demonstrate how we adapted it to our current architecture, and assist in the initial environment setup. In the next steps, we will explore the important folders and files for the environment’s functioning, as well as introduce DBT Cloud, our project orchestrator.

I hope this guide is helpful for those interested in a governed data transformation solution or for those looking to enhance their knowledge in this tool. DBT is an incredible tool that has transformed the daily interaction within our data team and brought greater governance to the datasets we create. If you have any questions or suggestions, please feel free to reach out to me on Linkedin.

Versão em Português: https://medium.com/@alice_thomaz/d7315d21ad34

Part 2: Organizing the project folders.

--

--