Run DBT in Azure Data Factory: A Clean Solution

Guang X
6 min readJan 15, 2022

--

Bacground

In modern data platforms, there are always an enormous number of SQL transformation used, which requires proper management and governance. DBT (getdbt.com) recently became the most popular tools to manage SQL-based data transformation pipeline. As a part of Extract-Load-Transform pipeline, DBT data transformation usually has to be embedded in data orchestration tools (e.g. Airflow). So data transformation can be done right after E(extract) and L(load).

At New Aim Pty Ltd, Azure is one of main cloud services we use to build our data platform. On Azure Cloud, the easiest to use data orchestration tool is Azure Data Factory (ADF), which natively supports integration with lots of data sources. ADF is a great tool for data ELT. But we have not seen a good solution to integration DBT into ADF for data transformation.

This article will share New Aim’s experience to enable DBT steps in Azure Data Factory. Before continuing, please be aware that this article assumes the reader knows:

- how to use DBT and already have their DBT project setup and ready to use;

- how to use docker or other container tools;

- how to use Azure Container Registries or other Container Registries;

- the basics of Azure Data Factory (ADF) and have already created a ADF ready to use;

Overview

This solution is inspired by this blog with some improvements and simplification.

There are some key components in this solution:

1. The DBT project is containerized as an image and ready to run “dbt build” command;

2. The container image is deployed as an Azure Container Instance (ACI), with a STOPPED status by default;

3. Azure Data Factory’s managed identity has CONTRIBUTOR role to manage the deployed ACI resource;

4. In Azure Data Factory, WEB ACTIVITY can be used to start ACI via Azure REST API;

Some highlights of this solution are:

  • No Azure Function required: Some solutions suggest the usage of Azure function, which is actually unnecessary if we can call Azure REST API from ADF Web Activity
  • No service principal required: By using Azure system-assigned managed identity, ADF can access Azure REST API directly without specifying a token or a password
  • Cost Effective: Additional cost to run DBT in ACI is cheap. But users still have to pay their data warehouse costs

Now this article will introduce above design one by one.

Containerize the DBT project

Containerize an exiting DBT project could not be easier. Just make sure the database and password are consumed as environment variables.

To deploy the built container image, it should be pushed / uploaded to a Container Registry. And if working on Azure Cloud, Azure Container Registry is the easiest option.

Below screenshot shows our dbt image is pushed to an existing Azure Container Registry

Deploy DBT image as Azure Container Instance

Once the image is pushed, we can create a corresponding Azure Container Instance (ACI), as shown in the following screenshot:

Please pay attention to the Advanced tab:

  • Restart Policy: If we do not wish to restart the container automatically, we should choose Never as restart policy.
  • Environment Variables: Provide database user name, password, and any other secret environment variables required to execute DBT
  • Command Override: Use [ “/bin/bash”, “-c”, “dbt build” ] or any other command required to build DBT data

An interesting fact is that: when an Azure Container Instance (ACI) is created, we can start the container group and execute the specified command whenever is required. And ACI will return to STOPPED status once it finishes the specified command and stop billing.

Start Container Group from Azure Data Factory

To start a Container Group from Azure Data Factory, we can use the ACI REST API as documented in this page:

POST https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.ContainerInstance/containerGroups/{containerGroupName}/start?api-version=2021-09-01

In Azure Data Factory, we can use Web activity to invoke above API:

As shown in below screenshot, there are several key items in Web activity’s Settings page:

  • URL: Url should be the one as provided Azure documentation https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.ContainerInstance/containerGroups/{containerGroupName}/start?api-version=2021-09-01. Please remember to replace subscriptionId, resourceGroupName, containerGroupName. These values can also be provide as pipeline’s parameters
  • Method: should be POST, as prescribed in Azure documentation.
  • Body: Can be simply an empty string “”
  • Authentication: Can be any authentication methods. The neatest way is Managed Identity.
  • Resource: should be https://management.azure.com/. This is a URL-encoded identifier URI that’s specified by the REST API you are calling. More details can be found on this page.

However, this pipeline will fail with the following error message:

Now we need to set up proper access for Azure Data Factory.

Grant ADF permission to Start ACI

In Azure Container Instance, we need a new Role Assignment.

We need a role that has permission to Start Container Group. Among all the build-in roles, Contributor is the one with the least permission. We can also create a custom role with only the permission we need.

Then move to Members tab and select Managed identity.

When selecting a managed identity, please try to look for the system-assigned managed identity for data factory:

Once the permission is set up. The pipeline now should be able to execute successfully:

Further Improvement

There is still several other limitations, that ADF cannot receive execution results after starting an ACI. We can further expand the pipeline like this:

We can use the same REST API to poll the ACI’s status until it reaches Succeed status. Also we can stop the ACI if it times out or failed.

We can even expand the this to any other ELT steps that are based on a container images. Also this solution shows the potential of ADF to invoke all Azure REST API and achieve much more functionality then we expected.

Please feel free to leave your thoughts here and discuss interesting cloud and data technologies.

New Aim Pty Ltd is a leading-edge supplier of products for the Australian market, serving your business with the latest in technology and expertise.

Interested in working on data engineering technologies? New Aim is hiring Data Engineers! Please check out our job post.

--

--