Deploying and Running dbt on Azure Container Instances

A cost-effective approach to executing data transformations using dbt in Azure.

Venkat Sekar
Hashmap, an NTT DATA Company
10 min readJan 7, 2021

--

Overview

dbt continues to be a widely adopted tool for data transformation pipelines, especially when moving to an ELT-based approach. As a dbt implementation services partner, we have recommended and deployed dbt on many of our Snowflake client engagements.

While dbt itself is a Python-based framework, you run it using dbt CLI commands. This means that it requires an execution environment, preferably a Python virtual environment running in a VM.

There have been several approaches discussed to achieve this as per the links below:

And of course, there is the option of dbt Cloud itself.

But what if you have different criteria?

  • Offer a multi-tenant dbt cluster — an environment where multiple independent dbt pipelines execute without affecting each other.
  • Horizontally scalable, without a need to worry about sizing. As more dbt pipelines get added, there is little to worry about how it will affect existing pipeline executions.
  • Billed only for the duration of the pipeline execution.
  • Ability to execute multiple versions of the dbt data pipeline.
  • Tagging of dbt data pipeline execution, so billing and cost analysis can be observed.
  • Service accounts used by the dbt data pipeline can be different.
  • Share the same cluster across multiple targets or multiple data sources.
  • Cloud-native implementations capture logs.
  • Notification and alerts are activated based on logs.

Thus an implementation closer to dbt cloud, minus the UI, running within your environment would be a better fit for your criteria.

The benefits of this approach lead to:

  • The execution cluster can be maintained by an admin group, separate from the data pipeline development teams.
  • The dbt data pipelines are developed and deployed by separate projects or development teams.
  • You also want to run in an environment like Kubernetes, but your team has yet to learn and adopt Kubernetes.

Much of the capabilities mentioned above can be done using existing capabilities in the cloud. Follow along as I walk through a prototype implementation using Azure Container Instances. I will present the various choices taken into consideration and other aspects of the code/functionalities.

If you are interested in implementing AWS, refer to my previous blog post: Deploying and Running dbt on AWS-Fargate. There are differences in Azure adoption vs. AWS adoption; this is primarily due to Azure Container Instance's being significantly different from AWS Fargate.

Solution Overview

For this demo, the data warehouse of choice is Snowflake. The code is available in my GitLab.

Highlights

  • A custom dbt Docker image is hosted in Azure Container Registry (ACR).
  • The dbt projects are packaged and hosted in an Azure file share.
  • The service account for logging into the data warehouse is stored in Azure Key Vault.
  • Each dbt model execution is wrapped in its own script file, for example: [dbtdataops/dbtoncloud/dbt_datapipeline.sh].

Here are the execution steps:

  1. A schedule trigger, like an Azure Function, starts a specific ACG.
  2. This causes the underlying Docker image to be pulled from the ACR.
  3. Once pulled, an ACI is instantiated.
  4. The instantiation would invoke the default configured script ‘entrypoint.sh’. A working directory sets this script for the application.
  5. The script ‘entrypoint.sh’ would then invoke ‘run_pipeline.sh’.
  6. The ‘run_pipeline.sh’ copies the dbt project hosted in a file share into its local temporary directory. The file share is mounted to the Docker image. The dbt project, which is zip archived, is then unarchived.
  7. The ‘run_pipeline.sh’ script would then instantiate a data pipeline specific script, ‘dbt_datapipeline.sh’, as configured in the environment variable ENV_DBT_RUN_CMD.
  8. The data pipeline script would invoke ‘dbt_init_run.sh’
  9. This ‘dbt_init_run.sh’ script uses a configured identity to access the key vault to reach the datastore connection information like url, userid, password, etc.
  10. A dbt model run is then performed. This would run the necessary transformation pipeline in the datastore (Snowflake or BigQuery).

Once the transformation is complete, the ACI gets terminated.

You can watch my demo of running dbt on an Azure container instance in this Hashmap Megabyte.

Design Considerations

Choice of the Execution Environment

Kubernetes (AKS, EKS, or GKE) would be ideal for a multi-tenant, horizontally scalable, and fully cloud-native functionality. However, in my various client engagements, it was pretty evident that not all client teams were ready to take on Kubernetes. Some reasons were:

  • The client was either in their initial journey into the cloud; they were not knowledgeable about Kubernetes.
  • Most of the data acquisition, ingestion, processing, etc., was accomplished by various SaaS offerings; hence standing up a k8s cluster was unnecessary.
  • Hosting and running a k8s worker node 24/7 and performing only 1 hour of data processing tasks does not make sense.

Azure Container Instance provides a simplistic approach to hosting and executing containerized applications. Adoption and implementation are much simpler and allow the team to migrate the containers into Kubernetes if needed.

The Azure Container Instance (ACI) pricing was also preferred, as you are billed only for container execution duration. You don’t need to have the container running 24/7. You can instantiate the container using APIs, and once the container finishes its task, it can be shut down.

Scalability can be achieved by instantiating different instances of the ACI/ACG as needed. The instances differ primarily based on parameters passed to the ACI.

Security is offered if the underlying image can be pulled from the ACR and instantiated into an ACI.

All logs written to the console can be routed to Azure Monitor logs by the container application.

Why not Azure Functions?

Well, long story short, Azure Functions has a current time limit of 15 minutes. While typically, the data pipelines I had developed run less than 5 minutes, that timing would not be the same in all conditions.

The SQL complexity, DAG depth, and data volume might result in a longer execution time-limit. So, to be on the safe side, Azure Functions was not chosen.

Azure Container Instances

Azure Container Instances (ACI) are packaged and deployed as Azure Container Groups (ACGs). An ACG will host only 1 image of the dbt docker. This allows us not to have to worry about sizing. The ACG instance is specific to a dbt project. This allows isolation between multiple dbt projects.

dbt is a simple process, and the model transformation is delegated to the data warehouse, like Snowflake or BigQuery. For that reason, the memory and CPU requirement is very minimal, so the choice for CPU/memory is 1vCPU with 0.5GB.

Docker Image

We expect the following functionalities to be present in the Docker image:

  • dbt command/libraries and its dependencies
  • Azure CLI, with which we connect to resources like Azure file share, key vaults, etc.
  • Bash for shell script execution.

Azure provides Azure CLI as a Docker image, mcr.microsoft.com/azure-cli. This also contains the bash shell, so we use this as a base image. We also install dbt and push it into the Azure Container Registry (ACR).

What about the dbt project?

A single dbt project typically can be executed with different model files. For example, a model run for data preparation would be separate from a model run from materialization models.

Also, an ACG instance is particular to a dbt model execution. Azure currently does not allow running the same instance of ACG with different choices of parameters.

If you are going to ask, what about updating a container group using environment variables?

Doc: Update containers in Azure Container Instances

The ACI gets auto terminated after a dbt model run. This allows us to reduce costs and keep the solution on an on-demand basis. However, if you look at the below screenshot from the doc:

This “update” would not work for us.

I tried & verified!!! Plus, I was not convinced with the update command as I have to pass in the ACR credentials, CPU, memory, and other parameters even though they are not meant to change.

So, to keep the Docker image as independent as possible, the dbt project is not burnt into the image.

So how do we get the dbt project?

There are multiple ways to achieve this. One way is Doc: Mount a gitRepo volume in Azure Container Instances. Another approach I achieved is:

  • Packaging the dbt project into an Azure file share.
  • On ACG startup, downloading the project from the file share and unarchive the content.

I choose this approach as different clients have different code repos, which allows us to adapt to any situation. Below is a screenshot of the content that is achieved in the dbt project.

Why is the dbt data pipeline project not a part of the Docker image?

We wanted our solution to serve multiple pipelines. Preserving the dbt data pipeline projects into the Docker image would result in multiple ECR’s and costs could potentially go up.

By keeping the dbt data pipeline project outside of the image, we could have just one ECR instance and fewer Docker images to maintain from vulnerability scans. Technically, since the Docker image does not have the client code, we host it in a public Docker hub.

How does the container instance look?

What are the scripts that you keep mentioning?

When the container gets instantiated, a set of pre-developed scripts is executed when the dbt model execution occurs. They are as follows:

  • entrypoint.sh: This is present in the Docker image. Its main purpose is to create an application directory to host the dbt project. It will then proceed to invoke the run_pipeline.sh script.
  • run_pipeline.sh: This is used to unarchive the dbt project and invoke the specific dbt wrapper script, for example, dbtdataops/dbtoncloud/dbt_datapipeline.sh.
  • dbt_run_init.sh: This is used to read the Snowflake connection information from a key vault.

Why not dbt RPC?

Using dbt RPC would require you to host dbt in a container, and this container might need to be available and up 24/7. You would increase the container's size as it might need to process multiple dbt data pipelines in parallel.

Observations

Cost

Our billing incurred was less than $2 per day during our multiple execution runs of sample projects. The mileage will vary in your environment when adopting this approach for wider projects.

Execution time

As mentioned earlier, the execution time varies across scenarios. For example, based on the data volume, a typical initial data load could run longer while incremental updates could be shorter.

Logging

All logging can be sent to Azure Monitor Logs. This will be an item to be implemented based on your individual use case and requirements.

AWS Fargate vs. Azure Container Instance

If given a choice between AWS Fargate vs. ACI, I would choose the AWS Fargate approach. The reason being that It allowed me to create a single instance template and use it widely across multiple projects and scale horizontally. Part of the reason is that the flexibility of changing the environment variable during each run. However, in ACI, changing the environment variable for the same instance is not feasible — especially for short-lived ACI instances.

If you want dbt to be run inside your network, it’s just a small configuration in the case of Fargate. Whereas, in Azure, you might end up in a typical hosting situation as you would with other Azure resources.

Final Thoughts

This pattern and reference implementation have been implemented based on a well-architected framework. It might not address all the points, but it will get you started. Check out my follow-up blog post on integrating dbt with Azure Data Factory (ADF):

Ready to Accelerate Your Digital Transformation?

At Hashmap, we work with our clients to build better together.

If you’d like additional assistance in this area, Hashmap offers a range of enablement workshops and consulting service packages as part of our consulting service offerings, and I would be glad to work through your specifics. Reach out to us here.

Are you looking to gain a better understanding of what approaches, solutions, and tools are available in the data integration space and how to best address your specific integration requirements?

Hashmap’s Data Integration Workshop is an interactive, two hour experience for you and your team where we will provide you with a high value, vendor-neutral sounding board to help you accelerate your data integration decision-making process, and selection. Based on our experience, we’ll talk through best fit options for both on premise and cloud-based data sources and approaches to address a wide range of requirements. Sign up today for our complimentary workshop.

Feel free to share on other channels, and be sure and keep up with all new content from Hashmap here. To listen in on a casual conversation about all things data engineering and the cloud, check out Hashmap’s podcast Hashmap on Tap as well on Spotify, Apple, Google, and other popular streaming apps.

Other Tools and Content You Might Like

Venkat Sekar is a Senior Architect for Hashmap, an NTT DATA Company, and provides Data, Cloud, IoT, and AI/ML solutions and expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers.

--

--