Integrating dbt with Azure Data Factory (ADF)

Venkat Sekar
Hashmap, an NTT DATA Company
6 min readJan 18, 2021

Azure Data Factory is a popular solution for Azure customers adopted for data acquisition, ingestion, processing pipelines. When it comes to triggering a dbt process hosted in an Azure Container Group (ACG), a typical solution is to use Azure Logic Apps or Azure Functions.

It would be much easier if dbt could be integrated with ADF so that all the code base is in the same place for ingestion, processing, etc. For this reason, I’ll demonstrate how this can be achieved using ADF.

This article is a follow-up to my previous blog post, Deploying and Running dbt on Azure Container Instances. In this last article, I demonstrated how to execute dbt data transformation models using Docker with Azure Container Instance/Groups (ACG).

Throughout the article, I referred to the Azure container group that specifically hosts the dbt project as an ACG.

Reference Implementation Code

The code used in this article is available in the repo Gitlab: https://gitlab.com/hashmapinc/oss/dbtonazure

Simulations

For the sake of a demo and workable prototype, I am simulating a wait time in the ACG by introducing a sleep period of 5 minutes.

Once instantiated, I highlight :

  • The service principal secrets that are stored in the key vault
  • The key vault linked service in the ADF
  • The pre-configured pipeline which is used to trigger the ACG
  • The stopped state the ACG was in
  • How to manually trigger the pipeline
  • How the pipeline triggers the ACG and waits
  • How ACG starts and performs its actions
  • The log output of ACG and also the status in ADF, once completed
You can watch my demo of running dbt on an Azure container instance in this Hashmap Megabyte.

Design

Service principal

We can start an instance of the ACG using a REST API, as mentioned in the Doc: Azure Container Instances REST API reference. To access this service, we need to obtain an access token.

As of today, we cannot get a token for an ADF managed identity or the previously created user-managed identity (hmuse1dbtazuai01). However, we can use the service principal, which we used to obtain an access token, as mentioned in Doc: How to use managed identities for Azure resources on an Azure VM to acquire an access token.

To aid in this, the service principal id and its secrets are stored in the key vault during deployment:

Azure Data Factory

Linked service

For ADF to read the key vault's secrets, it has been pre-configured with a linked service and access policies. There it is defined for the ADF to access the key vault.

Pipeline : acg_start_and_wait_pipe

A parameterized pipeline, acg_start_and_wait_pipe, is also preconfigured in the ADF instance. I choose to have parameterized the pipeline so that it can be used in adoption across multiple instances.

The following are its parameters:

  • acg_name: Azure container group, which hosts the dbt container.
  • sleep_time_potential_completion: A rough estimate for the time the data pipeline is expected to finish. (specified in seconds)
  • az_tenant_id: Azure tenant id.
  • az_sub_id: Azure subscription id.
  • acg_rg: The resource group which hosts the ACG.

Activities: Get the access token

As mentioned above, we need to get the service principal and its secrets stored in the key vault and use it to get an access token. These are observed in the following activities:

  • kv_get_acr-sp
  • kv_get_acr-sp-scrt
  • get_access_token

Activities: ACG start and wait

As mentioned above, we can start the ACG using the REST API. Once the ACG starts, the call returns immediately. Since transformation could run for some time, I have introduced a wait activity. The wait time depends on the amount of data processed. You would have to set this to some minutes that are more appropriate to your adoption. The parameter, sleep_time_potential_completion, is the value to set.

These steps are observed by activity :

  • start_acg
  • Wait_for_acg_finish

Activities: Get status of ACG

To get the status of the ACG, you can use the Doc: GET operation:

  • get_acg_status

If the returned state is “Succeeded,” it means that the ACG has finished processing.

If the state is not successful, we would need to wait for some more time. I left the implementation of this loopback for you to implement.

Limitations

Chaining the ACG invocation

It is only natural for you to ask if we can chain multiple ACG calls using the execute pipeline activity. Unfortunately, the current implementation does not return any output other than the RunId. Azure had reported that they are working on this feature. Until that time, the approach would not be straight forward and would be complex.

Get Started and Provide Feedback

I hope this helped advance your understanding of using dbt and ADF together, and I’d be very interested in your thoughts, comments, and feedback as you go down this path.

Here’s the previous post if you want to check that out as well!

Ready to Accelerate Your Digital Transformation?

At Hashmap, we work with our clients to build better together.

If you’d like additional assistance in this area, Hashmap, an NTT DATA Company, offers a range of enablement workshops and consulting service packages as part of our consulting service offerings, and I would be glad to work through your specifics. Reach out to us here.

Are you looking to gain a better understanding of what approaches, solutions, and tools are available in the data integration space and how to best address your specific integration requirements?

Hashmap’s Data Integration Workshop is an interactive, two hour experience for you and your team where we will provide you with a high value, vendor-neutral sounding board to help you accelerate your data integration decision-making process, and selection. Based on our experience, we’ll talk through best fit options for both on premise and cloud-based data sources and approaches to address a wide range of requirements. Sign up today for our complimentary workshop.

Feel free to share on other channels, and be sure and keep up with all new content from Hashmap here. To listen in on a casual conversation about all things data engineering and the cloud, check out Hashmap’s podcast Hashmap on Tap as well on Spotify, Apple, Google, and other popular streaming apps.

Other Tools and Content You Might Like

Venkat Sekar is a Senior Architect for Hashmap, an NTT DATA Company, and provides Data, Cloud, IoT, and AI/ML solutions and expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers.

--

--