Practical Guide to DBT: Orchestrating the Project.

A 3-step journey to mastering the data transformation tool — DBT — Part 3 of 3.

Alice Thomaz
11 min readOct 30, 2023

This is the third in a series of articles aimed at guiding and providing more information about DBT, a widely used tool in the current market for data transformation. In this article, we will discuss how to orchestrate your environment using DBT Cloud. If you wish to start reading the guide from the beginning, simply access the link: Unraveling Architecture and Initial Configuration. And if you want to delve even deeper or even consider adopting the solution, it is recommended to take a look at the official product website: Product — DBT Cloud.

>>> DBT Cloud

DBT Cloud is a cloud-based platform developed to simplify the development, execution, and scheduling of data transformation models using DBT. It is a subscription product offered by the official company, DBT Labs.

With DBT Cloud, you can efficiently orchestrate, develop, control, and monitor your production environment. Currently, there are three available plans:

  • Developer: This is the free version offered by DBT Labs, ideal for beginners looking to explore and take their first steps in the development environment. It can be used by a single developer.
  • Team: The intermediate plan of the solution is widely adopted in the market. Currently, it costs $100 per month per user developer and offers additional features such as API access and the ability to create users with read permissions.
  • Enterprise: This is the customized plan that allows you to scale the environment according to your company’s specific needs. For pricing and negotiations, you need to contact DBT Labs sales directly. This plan offers advanced and personalized support but comes with higher costs that vary based on your business usage.

The plans are subject to constant changes in pricing and features, which has led us to reevaluate the best way to orchestrate our project. Currently, we are using the Team plan, which suits our needs, especially in terms of scheduling and monitoring the environment. However, we are considering alternatives in the market, such as migrating to Airflow or exploring AWS tools with Fargate support.

For more details and to stay updated on plan changes, I recommend visiting the official tool website: Pricing.

>>> Project

Initially, it is necessary to create a project where the subsequent steps will be developed. Divided into four parts, the setup process is responsible for establishing the connection between the repository and the Data Warehouse. In this step, you will provide the main connection credentials.

If this is your first access, the system will automatically guide you through this process. However, after creating the project, you can return to these settings by following the path: Settings > Acconts Settings.

>>> Environment

With the project created, let’s now explore the environment. To do this, follow the path: Deploy > Environments > Create Environment. In it, you will define some basic settings divided into three parts:

  • General settings: In this first part, you will choose a name for the environment, specify whether it is a development or deployment environment, and select the default version of DBT to be used.
    It’s important to stay updated on version changes to take advantage of improvements and ensure you aren’t using an unsupported version. To keep track of release dates, we recommend referring to the official DBT documentation: About dbt Core versions.
  • Deployment credentials: In the second part, you will configure the necessary credentials to connect to the database and execute scheduled tasks in this environment. Additionally, you will define the schema in which the models will be generated. In our case, we use the “dbt_prod” key for the deployment environment.
  • Extended attributes: The third part is optional and allows you to add additional attributes, typically used in the “profiles.yml” file, to define different types of connections. This includes settings such as connection timeout, region, node usage, among others:
connect_timeout: 600
ra3_node: true
region: us-east-1

>>> Job

In DBT Cloud, a Job represents a task in which you can schedule a sequence of commands to be executed whenever it is triggered, whether manually, through scheduling, or in response to a pull request.

All information related to the processes executed is logged on the platform, allowing you to track existing Jobs, update intervals, time spent on each step/modeling, and the commands executed by the platform in your Data Warehouse, among other details.

The Jobs are divided into two types:

Deploy jobs

You can configure a series of commands to update your DBT project’s models and processes. You can invoke them manually or schedule their execution. Here, I describe the 4 steps to set up a new job. To access this configuration, follow the following path: Deploy > Jobs > Create job > Deploy Job.

  • Job settings: Here, you will define a name for your job, a description, and the environment. By default, the environment in which the job is being created will already be selected.
  • Execution settings: In the “Commands” section, you will include the commands to be executed. Each command line functions as a step, and if one of them fails, the subsequent steps will not be executed. When running a folder of models, you don’t need to worry about dependencies, as the platform already identifies the lineage flow.
    Additionally, you can organize the commands with the help of the “selector.yml” file presented in step 2 of this guide. Instead of running a model or folder directly, you can execute a selection group defined in the project.
    In the selection options, you will find “Generate docs on run”, which, when enabled, generates documentation files every time the job is executed based on the “schema.yml”. There is also the “Run source freshness” option, which, when enabled, allows you to track the last update date of the models according to the rules defined in “sources.yml”.
  • Schedule: If enabled, you can set the job execution schedule. You can choose from types: Cron schedule, Hours of the day, and Exact intervals.
  • Advanced settings: In the final step, you can customize some optional features. The “target” option allows you to customize essential processes and filters within the project, as exemplified in the second part of the guide. As for “threads”, they determine how many models can be executed simultaneously. By default, there are 4 threads, but you can customize them according to your data warehouse’s capacity.
    Increasing the number of threads can speed up the job processing, but if the database doesn’t have enough capacity to support the volume, it can create execution queues that hinder the desired performance. Therefore, it’s important to balance this configuration according to the available capacity.

Continuous integration jobs

You can use this tool to execute a job whenever a new pull request (PR) is created in your Git repository. However, to enable this function, it is necessary to establish a native connection between Cloud and the repository. If your repository is self-hosted and you are using the Team plan on Cloud, this function will not be available.

Once connected, the system will execute and test only the models that have been modified or affected by the change. To access this configuration, follow the path: Deploy > Jobs > Create job > Continuous integration job.

The configuration is divided into three parts: Job settings, Execution settings and Advanced settings. The main differences compared to a Deploy-type job are as follows:

  • In step 1, it’s important to enable the “Triggered by pull requests” option.
  • In step 2, there are two significant differences.
    The commands used here are: “dbt seed --select state:modified+” e “dbt run --select state:modified+”.
    In “Compare changes against an environment”, it’s essential to reference a baseline job. In this case, we select our daily job for analytical models.

Once the configuration is completed and enabled, you can make this process a mandatory approval step before the merge. In this case, both DBT Cloud and a partner must approve the changes before the merge is allowed, as demonstrated in the following example.

Furthermore, for those working with large databases like this, it’s advisable to include a load limit for the CI target during model creation. This prevents the process from getting stuck while running a full load during testing. For example:

{% if target.name == 'ci' %}
LIMIT 1
{% elif is_incremental() %}
WHERE tstamp_column ::DATE >= DATEADD(DAY, -{{var('filter_days')}}, CURRENT_DATE)
{% endif %}

>>> Documentation

With the schema.yml of the models properly filled and the “Generate docs on run” function enabled in the job, you can set this job as a reference point for the project and then view it through a web application.

To do this, follow these steps: Settings > Account Settings > Projects > Artifacts > Documentation and select the job with the function enabled.

Once this process is completed, you will be able to access detailed and consolidated information about all the models in your project, including code, documentation, lineage, and more, through the main tabs in Documentation > Open Docs.

>>> Usuários

The types and quantity of licenses depend on the plan you’ve subscribed to, but in general, there are three types of licenses:

  • Developer: Users with this license have full access to the platform, including development and deployment features. They can manage the account, create jobs, access the API, and more. In the Team plan we use, we are entitled to only one developer account. However, since we only use it to manage job scheduling, it doesn’t interfere with our daily activities.
  • Read-Only: Users with read access have a more restricted view in the platform, including a shallower overview of execution history with limited access to details. Nevertheless, they can still view all documentation and receive job notifications. Currently, we have access to five accounts with this type of license in the Team plan.
  • IT: A new IT license was recently introduced, although we don’t use it. With this type of license, users have access solely for managing users, groups, and licenses, and they can receive job notifications. It is available only in the Team and Enterprise plans.

To manage users, go to: Settings > Account Settings > Users.

>>> Notificação

You can configure DBT Cloud to receive notifications via email or Slack channel whenever a job is successfully completed, fails, or is canceled.

  • Email: To set up email notifications, go to Settings > Notification Settings > Email Notifications. Choose the user’s email address for whom you want to send notifications, and then set the desired alert type for each job.
  • Slack: If you prefer notifications via Slack, follow these steps: Settings> Profile Settings > Personal Profile > Linked Accounts > Slack. Please note that you need to have an admin account in Slack to perform this initial setup.
    After linking your account, go to Notification Settings > Slack Notifications and select the channel and alerts you wish to receive. It’s important to note that you can use only one Slack channel for all notifications.
    Here is an example of a generated alert:

>>> API

For Team and Enterprise plans, DBT Cloud provides access to an API that allows you to extract historical execution data, documentation, and more. Additionally, you can manage your environment, configure jobs, start and cancel executions, among other functions. To explore the possibilities of use in more detail, I recommend checking their official website: dbt Cloud Administrative API.

We used the API to create a connection between different dependent pipelines. However, currently, we primarily use it to enhance governance in our environment. Thanks to the collection of the “manifest.json” file, we are able to create a catalog and tracking of our models in DBT and link them to catalog tables available in Redshift. Soon, I will share materials related to our progress in this project.

But before you can start working with the administrative API, you will need to create a token within the platform by following this path: Settings > Account Settings > Service Tokens > New Token.

>>> IDE

The Integrated Development Environment (IDE) functionality is a web interface that allows the construction, testing, and versioning of DBT projects within the DBT Cloud platform. It is a comprehensive and highly effective tool for modeling development, although it’s important to note that it requires Developer-type licenses. Depending on the number of users who need this license, the tool’s costs can increase significantly.

As demonstrated in the first step of this guide, we develop models locally and primarily use DBT Cloud for project orchestration. However, for those who wish to explore the use of the IDE more deeply, I recommend getting better acquainted with this feature.

As a starting point, I suggest consulting the official DBT documentation, which is quite comprehensive: About the dbt Cloud IDE.

Source: IDE user interface

In the first part of this guide, I provided an overview of the tool and explained how to set it up, as detailed in “Unraveling Architecture and Initial Configuration”.

In the second part, I delved deeper into the folders and key files necessary for building a project in DBT, as presented in “Organizing the project folders”.

In this third and final part, I cover DBT Cloud, the current orchestrator of our project. It is a remarkable and comprehensive tool that has been rapidly evolving in recent months. While we are considering the possibility of migrating to another tool to unify our environments, I recommend that everyone get to know a bit more about the features of DBT Cloud. During these initial years of using DBT, it has played a crucial role for our team.

Finally, I hope it has been helpful for those interested in a governed data transformation solution or for those looking to enhance their knowledge of this tool. DBT is an incredible tool that has transformed the daily interaction within our data team and brought greater governance to the datasets we create. If you have any questions or suggestions, please feel free to reach out to me on Linkedin.

Versão em Português: https://medium.com/@alice_thomaz/e32f97e5ca79

--

--