dbt Orchestration in Azure DevOps Pipelines

by Yu Fai Wong

Data build tool (dbt) is a great tool for transforming data in cloud data warehouses like Snowflake very easily. It has two main options for running it: dbt Cloud which is a cloud-hosted service provided by the makers of dbt, Fishtown Analytics, or dbt CLI a command-line interface run in Python.

Why run dbt on ADO?

For many organizations, they may be using ADO (Azure DevOps) to host their dbt project code or GitHub. Because of the easy integration with CI/CD tools of ADO, it’s quite simple and cost-effective. In this blog post, I’ll walk you through dbt orchestration in ADO Pipelines.

Let’s get started

The first step is to host the code somewhere in a git repository, whether in Azure or elsewhere.

Pipeline YAML

From there, we need to create a pipeline YAML file and add it into the project. There are many options to set for this pipeline file, but I’ll expose some common options here in an example file.

Here’s the Pipeline Reference Documentation.

Name: The name of the pipeline.

Trigger: An optional way to trigger the pipeline based on commits on certain branches. Setting batch to true will group up commits together into a single run, instead of running against every commit.

Schedules: Essentially a way to schedule a regular running of the pipeline, based on a cron format. Very flexible and can be set to run on weekdays, every hour, etc.

Pool: Which vm image to use. Generally speaking, using an Azure provisioned one is fine.

Stages: This is the main way to group individual jobs and steps together. Note: By default, stages are run in order of appearance in the file but one depends on gets set, they’ll be run in parallel in any order, fulfilling the dependencies.

Template: Allows you to create another YAML file that can be parameterized to deploy against multiple environments. In this case, I use it to specify the environment along with the environment-specific connection variables.

Template YAML

Here an excerpt from the template:

In this example, the first stage that runs is PreCheck. If it fails, Manual_Approval is then run (explained further below) and finally the actual deployment step which starts running dbt.

Condition: sets the condition (or expression that resolves into a condition) for this stage to run.

Environment: (Optional) Sets the environment to deploy to and asks for any permission if any.

In a previous version of Azure Pipelines, there was a simpler way to create a Manual Intervention step, but this feature isn’t supported in the current one, so a workaround with environments was created. In ADO, creating an empty environment and adding an approver is enough. Setting a group as an approver will mean only one person in the group needs to approve. Setting multiple individual users means all users need to approve.

Variables: Assigns a variable group to the scope. Used to pass in connection variables to dbt.

Env: Another way to assign environment variables. If a password is created in a variable group and set to secret, it will not be automatically mapped in and must be done manually as shown.

Now we can create the pipeline in ADO

Create a Variable Group

Creating a variable group will make reuse and change easier, without having to remap variables if anything changes.

Head over to Pipelines -> Library -> Variable Group

Set any variables and set the password to secret.

Create a Pipeline

Select Pipelines -> Pipelines -> Create Pipeline

Select the location of the code repository.

Select the repository.

Select an existing pipeline file.

Select the pipeline file we created.

And run. It should look like below.

And you’re done!

Logs and Debugging

Clicking on a running pipeline will show the console output:

Generally, this is enough to fix any dbt errors, but if you need to see the detailed dbt logs you need to set up artifact creation properly (not shown here) so that you can inspect the dbt log file.

Closing Thoughts

Because the heavy lifting of the transforms are done in the data warehouse itself, where dbt runs doesn’t matter in terms of processing power; running it as part of a CI/CD pipeline is both efficient from an automated testing perspective and from a cost perspective. A pipeline costs next to nothing (free in many cases) and is quite easy to set up. If however, your transforms take a lot of time, you will want to consider using a container or vm to run.

Hopefully, what I’ve shared through my experience helps you with dbt orchestration in Azure Pipelines.

Be sure and check out Hashmap’s dbt focus page at hashmapinc.com and also subscribe to Hashmap on Tap (Spotify, Apple, Google, etc.) and listen to a conversation with dbt co-founder Drew Banin.

Other Recent Stories, Videos, and Podcasts You Might Like

Feel free to share on other channels and be sure and keep up with all new content from Hashmap here.

Yu Fai Wong is a Cloud and Data Engineer with Hashmap providing Data, Cloud, IoT, and AI/ML solutions and consulting expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers.

--

--