Integrating dbt with Azure Data Factory (ADF)

Published in

Hashmap, an NTT DATA Company

6 min readJan 18, 2021

Azure Data Factory is a popular solution for Azure customers adopted for data acquisition, ingestion, processing pipelines. When it comes to triggering a dbt process hosted in an Azure Container Group (ACG), a typical solution is to use Azure Logic Apps or Azure Functions.

It would be much easier if dbt could be integrated with ADF so that all the code base is in the same place for ingestion, processing, etc. For this reason, I’ll demonstrate how this can be achieved using ADF.

This article is a follow-up to my previous blog post, Deploying and Running dbt on Azure Container Instances. In this last article, I demonstrated how to execute dbt data transformation models using Docker with Azure Container Instance/Groups (ACG).

Throughout the article, I referred to the Azure container group that specifically hosts the dbt project as an ACG.

Reference Implementation Code

The code used in this article is available in the repo Gitlab: https://gitlab.com/hashmapinc/oss/dbtonazure

Simulations

For the sake of a demo and workable prototype, I am simulating a wait time in the ACG by introducing a sleep period of 5 minutes.

Once instantiated, I highlight :

The service principal secrets that are stored in the key vault
The key vault linked service in the ADF
The pre-configured pipeline which is used to trigger the ACG
The stopped state the ACG was in
How to manually trigger the pipeline
How the pipeline triggers the ACG and waits
How ACG starts and performs its actions
The log output of ACG and also the status in ADF, once completed

You can watch my demo of running dbt on an Azure container instance in this Hashmap Megabyte.

Design

Service principal

We can start an instance of the ACG using a REST API, as mentioned in the Doc: Azure Container Instances REST API reference. To access this service, we need to obtain an access token.

As of today, we cannot get a token for an ADF managed identity or the previously created user-managed identity (hmuse1dbtazuai01). However, we can use the service principal, which we used to obtain an access token, as mentioned in Doc: How to use managed identities for Azure resources on an Azure VM to acquire an access token.

To aid in this, the service principal id and its secrets are stored in the key vault during deployment:

Azure Data Factory

Linked service

For ADF to read the key vault's secrets, it has been pre-configured with a linked service and access policies. There it is defined for the ADF to access the key vault.

Pipeline : acg_start_and_wait_pipe

A parameterized pipeline, acg_start_and_wait_pipe, is also preconfigured in the ADF instance. I choose to have parameterized the pipeline so that it can be used in adoption across multiple instances.

The following are its parameters:

acg_name: Azure container group, which hosts the dbt container.
sleep_time_potential_completion: A rough estimate for the time the data pipeline is expected to finish. (specified in seconds)
az_tenant_id: Azure tenant id.
az_sub_id: Azure subscription id.
acg_rg: The resource group which hosts the ACG.

Activities: Get the access token

As mentioned above, we need to get the service principal and its secrets stored in the key vault and use it to get an access token. These are observed in the following activities:

kv_get_acr-sp
kv_get_acr-sp-scrt
get_access_token

Activities: ACG start and wait

As mentioned above, we can start the ACG using the REST API. Once the ACG starts, the call returns immediately. Since transformation could run for some time, I have introduced a wait activity. The wait time depends on the amount of data processed. You would have to set this to some minutes that are more appropriate to your adoption. The parameter, sleep_time_potential_completion, is the value to set.

These steps are observed by activity :

start_acg
Wait_for_acg_finish

Activities: Get status of ACG

To get the status of the ACG, you can use the Doc: GET operation:

get_acg_status

If the returned state is “Succeeded,” it means that the ACG has finished processing.

If the state is not successful, we would need to wait for some more time. I left the implementation of this loopback for you to implement.

Limitations

Chaining the ACG invocation

It is only natural for you to ask if we can chain multiple ACG calls using the execute pipeline activity. Unfortunately, the current implementation does not return any output other than the RunId. Azure had reported that they are working on this feature. Until that time, the approach would not be straight forward and would be complex.

Get Started and Provide Feedback

I hope this helped advance your understanding of using dbt and ADF together, and I’d be very interested in your thoughts, comments, and feedback as you go down this path.

Here’s the previous post if you want to check that out as well!

Deploying and Running dbt on Azure Container Instances

A cost-effective approach to executing data transformations using dbt in Azure.

medium.com

Ready to Accelerate Your Digital Transformation?

At Hashmap, we work with our clients to build better together.

If you’d like additional assistance in this area, Hashmap, an NTT DATA Company, offers a range of enablement workshops and consulting service packages as part of our consulting service offerings, and I would be glad to work through your specifics. Reach out to us here.

Are you looking to gain a better understanding of what approaches, solutions, and tools are available in the data integration space and how to best address your specific integration requirements?

Hashmap’s Data Integration Workshop is an interactive, two hour experience for you and your team where we will provide you with a high value, vendor-neutral sounding board to help you accelerate your data integration decision-making process, and selection. Based on our experience, we’ll talk through best fit options for both on premise and cloud-based data sources and approaches to address a wide range of requirements. Sign up today for our complimentary workshop.

Data Integration Workshop | Hashmap

Data Integration Workshop We help reduce data integration option fatigue with insights, perspectives, team activities…

www.hashmapinc.com

Hashmap Technology Partner | dbt

All things Hashmap related to our technology partnership with dbt including stories, case studies, podcasts, videos…

www.hashmapinc.com

Feel free to share on other channels, and be sure and keep up with all new content from Hashmap here. To listen in on a casual conversation about all things data engineering and the cloud, check out Hashmap’s podcast Hashmap on Tap as well on Spotify, Apple, Google, and other popular streaming apps.

Hashmap on Tap | Hashmap Podcast

A rotating cast of Hashmap hosts and special guests explore technologies from diverse perspectives while enjoying a…

www.hashmapinc.com

Integrating dbt with Azure Data Factory (ADF)

Reference Implementation Code

Simulations

Design

Service principal

Azure Data Factory

Linked service

Pipeline : acg_start_and_wait_pipe

Activities: Get the access token

Activities: ACG start and wait

Activities: Get status of ACG

Limitations

Chaining the ACG invocation

Get Started and Provide Feedback

Deploying and Running dbt on Azure Container Instances

A cost-effective approach to executing data transformations using dbt in Azure.

Ready to Accelerate Your Digital Transformation?

Data Integration Workshop | Hashmap

Data Integration Workshop We help reduce data integration option fatigue with insights, perspectives, team activities…

Hashmap Technology Partner | dbt

All things Hashmap related to our technology partnership with dbt including stories, case studies, podcasts, videos…

Hashmap on Tap | Hashmap Podcast

A rotating cast of Hashmap hosts and special guests explore technologies from diverse perspectives while enjoying a…

Other Tools and Content You Might Like

How to Put the Azure Data Factory (ADF) Snowflake Connector to Use

Recently, I had the opportunity to explore both Matillion and Azure Data Factory (ADF) to load data to Snowflake’s Data…

dbt Orchestration in Azure DevOps Pipelines

Data build tool (dbt) is a great tool for transforming data in cloud data warehouses like Snowflake very easily…

Application Logging Simplified in Kubernetes : Part 1

Node-Level Logging

Hashmap Megabytes | Bite-Size Video Series

Hashmap Megabytes is a weekly video series in which mega cloud ideas are explained in bite-size portions.

Snowflake Utilities & Accelerators | Do more with Snowflake | Hashmap

Try out all the Snowflake utilities that Hashmap has available and do more with Snowflake: Snowflake Inspector…

Written by Venkat Sekar