Best Practices for Provisioning Databricks Infrastructure on Azure via Terraform

Gergo Szekely-Toth

Published in

Hiflylabs

10 min read4 days ago

The Project

In this guide, you will learn to quickly and easily set up a basic infrastructure proof of concept.

My original goal was to design a blueprint for an ‘ideal’ Databricks infrastructure, one that is:

Supportive of the setups most of our current and future clients use.
Usable to define the Hiflylabs standards and best practices for Databricks infrastructures.
Documented well enough that a junior colleague can set up a new infrastructure using my guidelines.
This infrastructure can be expanded upon and modified to fit your exact needs.

Need for Automation

The proposed architecture (see below) is complex enough to require automation for the setup and ongoing development of the Delta Lakehouse.

We implement this project as ‘Infrastructure as Code’ for the following reasons:

Consistency of repeatable deployments across environments.
Testability throughout infrastructure configurations.
Rapid recovery capabilities.
Self-documentation of the system’s architecture and configuration.
Version control and easier rollbacks (if needed).
Collaboration through team code review.

The two main areas where automation comes into the picture are setting up cloud resources and the release cycle of the Delta Lakehouse.

Proposed Architecture

The first step was to define what an ideal basic infrastructure is.

For this phase, I leveraged an architecture we recently proposed to a client for their data platform design. Rather than synthesizing various possible architectures to find a common denominator, I determined that this recent design could effectively address our first requirement (Consistency).

Additionally, selecting this architecture provides us with a strategic advantage should the client proceed with implementation in the future. 😉

For the purposes of this blog post, I’ve adapted the architecture into a more generalized design to maintain client confidentiality:

Let’s zoom in on the main elements of this proposal:

Cloud Storage

It could occur to you to separate the elements of the medallion architecture for a given environment into different Storage Accounts. For this project, I didn’t deem it necessary. Separation is possible, the benefit of such granular division is not apparent. But the design can be easily adapted if business requirements change.

Should all data related to an environment be stored in a single Storage Account? No. But let’s think about it:

Unity Metastore requires its own Storage Account.
Other layers could potentially use Unity to manage all catalogs in its own Storage Account.
Separating environments provides flexibility for differing needs.
A middle ground is achieved by separating only the Raw Source and Delta Lake for each environment.
This separation allows for independent security layers for both the Source Landing space and Delta Lakehouse space.

More info about physical separation can be found in the Unity Catalog best practices document.

Unity Catalog

Unity Catalog is solely responsible for the catalog functionality and access control.

The Delta Lakehouse layer should only know about (and use) the data via the three levels of the Unity Catalog namespace. No physical locations should be used in the Delta Lakehouse layer.

Delta Lakehouse

In this design, the Delta Lakehouse uses a Databricks platform and it’s built via dbt.

Hiflylabs has extensive experience utilizing dbt for the creation and management of lakehouses on Databricks platforms. To support the lakehouse development with out of box functionalities the dbt component is included in the design. However, it’s not a must to use dbt for building a lakehouse — it could be entirely built on Databricks itself, if that’s a requirement.

Terraform

Terraform emerged as the optimal solution for automating cloud resource setup. Given Hiflylabs’ prior experience with Terraform for Snowflake infrastructure provisioning, we were confident in its applicability to our current needs. Although the internal project team had limited Terraform experience, we embraced this as an opportunity for skill development.

CI/CD

For the Continuous Integration (CI) part — we will use Github Actions as a base to build on.
For the Continuous Deployment (CD) part — we will include a few different types of test examples that can be used for developing project specific requirements.

Installation Guide

Requirements

Terraform
Azure CLI
Using Microsoft Azure as a cloud provider
Have rights to create Azure resources

Note: The following installation steps have worked on my macOS Sonoma (Version 14.4.1). For other operating systems, please refer to the official documentation linked below each software.

Terraform

I installed Terraform using Brew. I opened a terminal and ran the following commands:

$ brew update

$ brew tap hashicorp/tap

$ brew install hashicorp/tap/terraform

To verify the installation I ran:

$ terraform -help

Click here for the official Terraform Installation instructions for other OS’ and troubleshooting

Azure CLI

I also used Brew for installing the Azure CLI. I ran the following in the terminal:

$ brew install azure-cli

To verify the installation I ran:

$ az -h

Follow the official Azure CLI installation instructions if you get stuck.

Authentication

For Azure authentication I used the Azure CLI.

I ran the following in the terminal:

$ az login

Note: It is enough to authenticate through ‘az login’ once and then Terraform can use the OAuth 2.0 token for subsequent authentications.

Implementation

While Terraform code can be consolidated into a single file for basic functionality, we prioritized testing modular code structures to ensure scalability for production-grade implementations.

Note: The code is available on Github — use v_1.0 release for this post.

Structure of the code base

The code base is broken up into two main folders:

modules/ — Contains reusable code packaged into separate modules.
azure/ — Manages the Azure resource group.
dbx-cluster/ — Creates either auto-scaling or single-user-clusters.
dbx-repos/ — Imports desired repositories into the DBX workspace.
dbx-workspace/ — Creates a DBX workspace.
unity-catalog-azure/ — Creates the Azure resources necessary for the DBX workspace.
unity-catalog-metastore/ — Creates the DBX account resources for the Unity metastore.
unity-catalog-workspace-assignment/ — Assigns a workspace to the Unity metastore.
projects/ — Defines an Azure environment from beginning to end. It is calling the modules with different parameters to avoid code repetition
adb-lakehouse/ — Builds all the resources required for this Proof of Concept.

Modules

The structure of the modules are the following:

Projects

The structure of the projects are the following:

(*) In Terraform there are two principal elements when building scripts: resources and data sources. Resource is something that will be created by and controlled by the script. A data source is something which Terraform expects to exist.

Variables

The ‘variables.tf’ file contains the variable declarations together with the specification of the default values. The ‘terraform.tfvars’ file contains the value assignments to the declared variables. Note: entries in ‘terraform.tfvars’ overwrite the default values specified in ‘variables.tf’.

All the non-sensitive variables contain default values. When there is no entry for a variable in ‘terraform.tfvars’, the default value is taken.

Variables that are declared as sensitive cannot have default values, so they need to be declared in ‘terraform.tfvars’ in order to run the code.

Running the code

Check out the repository

$ git clone https://github.com/Hiflylabs/databricks-terraform-bootstrap

Create and configure ‘terraform.tfvars’

Use a text editor and ‘terraform.tfvars.template’ as a basis.

Specify at least the sensitive values

azure-subscription-id
dbx-account-id
git-username
git-personal-access-token

Initialize Terraform

$ terraform init

Run the script

$ terraform apply

Results

Terraform Output

Azure Resource Group

module.azure[0].azurerm_resource_group.this: Creation complete after 2s [id=/subscriptions/9869e986-3f70-4d81-9f1d-7a7b29328568/resourceGroups/dbx-terraform-bootstrap]

Azure Databricks Access Connector

module.unity-catalog-azure.azurerm_databricks_access_connector.acces_connector: Creation complete after 17s [id=/subscriptions/9869e986-3f70-4d81-9f1d-7a7b29328568/resourceGroups/dbx-terraform-bootstrap/providers/Microsoft.Databricks/accessConnectors/access-connector]

Databricks Unity Catalog

module.unity-catalog-azure.azurerm_storage_container.unity_catalog: Creation complete after 1s [id=https://unitistoragedbxterraform.blog.core.windows.net/unitymetastore]

Azure Storage Container for Unity Catalog

module.unity-catalog-azure.azurerm_storage_container.unity_catalog: Creation complete after 1s [id=https://unitistoragedbxterraform.blob.core.windows.net/unitymetastore]

Azure Blob Data Contributor Role Assignment

module.unity-catalog-azure.azurerm_role_assignment.unity_blob_data_contributor: Creation complete after 26s [id=/subscriptions/9869e986-3f70-4d81-9f1d-7a7b29328568/resourceGroups/dbx-terraform-bootstrap/providers/Microsoft.Storage/storageAccounts/unitistoragedbxterraform/providers/Microsoft.Authorization/roleAssignments/35f9a12e-0dd4-0282-25e4-24d223-7c3e4f]

Databricks Workspace

module.dbx-workspace.azurerm_databricks_workspace.this: Creation complete after 2m25s [id=/subscriptions/9869e986-3f70-4d81-9f1d-7a7b29328568/resourceGroups/dbx-terraform-bootstrap/providers/Microsoft.Databricks/workspaces/dbx-terraform-bootstrap]

Databrick Unity Catalog Workspace Assignment

module.unity-catalog-metastore.module.unity-catalog-workspace-assignment.databricks_metastore_assignment.prod: Creation complete after 0s [id=3257333208545686|9d430cc4-e7dd-4b97-b095-9eb24226ac99]

Unity Metastore Data Access

module.unity-catalog-metastore.databricks_metastore_data_access.access-connector-data-access: Creation complete after 2s [id=9d430cc4-e7dd-4b97-b095-9eb24226ac99|access-connector]

Databricks Repos

module.dbx-repos.databricks_git_credential.ado: Creation complete after 2s [id=764573043456860]
module.dbx-repos.databricks_repo.all["repo_1"]: Creation complete after 7s [id=106725931470002]

Databricks Cluster

module.dbx-auto-scaling-clusters[0].databricks_cluster.shared_autoscaling["cluster_1"]: Creation complete after 7m16s [id=0507-085727-ym3yf34v]

Terraform Codebase

After running ‘terraform apply’ the following files/folders were created by Terraform:

Azure Resources

The following Azure resources were created:

Access Connector for Azure Databricks
Azure Databricks Service
Managed Identity
Network security group
Storage account
Virtual network

Databricks Resources

Workspace with Repo

The specified Github repository was imported

Cluster

The specified auto scaling cluster was created

Unity Catalog

A Unity Catalog was created and assigned to the Workspace

Findings

General experiences

I was glad to see how well the product works. It handles the infrastructure setup and changes quite well.
There is a slight learning curve, but it’s reasonably easy to get a feel for the inner workings of the product.
It gives me peace of mind when setting up environments — just like unit testing makes me sleep better.

Conditional creation of cloud resources

I was looking for a way to decide if the Azure resource group (or any other resource) should be created or not. I only found a workaround to provide this functionality.

In Terraform we can use conditional statements in the following syntax:

<boolean expression> ? <return value for true> : <return value for false>

I solved the conditional creation of the Azure resource group this way:

count = var.create-azure-resource-group ? 1 : 0

(source)

When we set the count for a Terraform resource, it will create as many pieces of that resource. In the case above, the conditional statement either sets this value to 0 or 1 — based on the configuration variable. When count is set to 0, the rest of the Terraform code for that resource creation is skipped.

Single vs. Multiple Creation of Resources

Running the Terraform code handled the creation of resources elegantly.

Everything worked like this until the point I created Databricks clusters via Terraform.

For the clusters however, the behavior has changed: running every terraform apply command created a new Databricks cluster with the same name as before.

My solution to this issue was to set count for the resource.

(source 1) (source 2)

Unable to destroy an existing Unity metastore

Even though I set force_destroy = true for the Unity Metastore resource, I keep getting the following error message when running ‘terraform destroy’ — I could not figure out the solution to this problem in time, so I ended up deleting the metastore manually when I needed to.

(source)

The error I get:

│ Error: cannot delete metastore data access: Storage credential ‘access-connector’ cannot be deleted because it is configured as this metastore’s root credential. Please update the metastore’s root credential before attempting deletion.│

Extended Infrastructure

There are a few more areas we need to consider when developing production code

Handle user and group permissions
Setting up dbt
Coming up with efficient ways of creating multiple environments — as of now, it is not clear if it makes more sense to:

create separate projects for every environment,
use separate .tfvar files and specify separate destinations for the Terraform files for every environment
or have a single monolithic project that creates everything — specify the differences in configuration

4. Handle deployment of Databricks code via Terraform

Future Development Roadmap

Extend Proof of Concept:

Implement CI/CD automation in subsequent iteration
Scale to encompass all environments, Unity catalogs, and storage accounts (leveraging existing expertise)

2. Expand Extended Infrastructure implementation

3. Address Unity Metastore management:

Develop Terraform-based destruction process for Unity Metastore
Evaluate appropriateness of Terraform for Unity Metastore management

Conclusion

Terraform proves to be an effective tool for automating cloud resource creation (Infrastructure as Code) in Databricks environments.

While there’s a learning curve to writing manageable Terraform code, its support for modular design enhances efficiency and maintainability.

As a data engineer, proficiency in this area is valuable — we’re expected to set up and manage non-production environments (e.g., development, testing).

The investment in automating infrastructure setup through code consistently yields benefits, reducing manual effort and potential errors.

All in all: This approach not only saves time but also improves consistency across environments, facilitating easier scaling and modifications as project requirements evolve.

Best Practices for Provisioning Databricks Infrastructure on Azure via Terraform

The Project

Need for Automation

Proposed Architecture

Cloud Storage

Unity Catalog

Delta Lakehouse

Terraform

CI/CD

Installation Guide

Requirements

Terraform

Azure CLI

Authentication

Implementation

Structure of the code base

Modules

Projects

Variables

Running the code

Check out the repository

Create and configure ‘terraform.tfvars’

Results

Terraform Output

Terraform Codebase

Azure Resources

Databricks Resources

Findings

General experiences

Conditional creation of cloud resources

Single vs. Multiple Creation of Resources

Unable to destroy an existing Unity metastore

Extended Infrastructure

Future Development Roadmap

Conclusion

Written by Gergo Szekely-Toth