Deploy Azure Databricks using Terraform

Alonso Medina Donayre
7 min readOct 18, 2023

--

We can now provision Azure and Databricks infrastructure via Terraform as IaC. In this tutorial, I’m going to show you how.

Requirements

Terraform Code

To get the latest version of this code please visit below repository.

Resources to be deployed

In this tutorial we are going to do a simple deployment of Azure Databricks following medallion architecture. At the same time we won’t use mounting because it’s a deprecated pattern.

The main resources to be deployed are the following:

  • Azure Resource Group
  • Azure Storage Account
  • Azure Service Principal
  • Azure Key Vault
  • Azure Databricks Workspace
  • Databricks Cluster

Steps to Deploy Azure Databricks

Azure CLI Login

Before running any terraform code, you need to authenticate to Azure cloud. In your prefered terminal (powershell, zsh, bash, git) run the below code:

az login

Terraform Project Structure

Create a folder called “databricks” in your preferred location with the following 3 files:

  • main.tf
  • providers.tf
  • variables.tf

All the commands that are in this tutorial are run from the databricks folder path.

Working directory structure

Terraform Providers

Terraform providers are plugins that enable Terraform, an open-source infrastructure as code (IaC) tool, to interact with various cloud, infrastructure, and service providers. These providers are responsible for defining and managing the resources and services offered by those providers in a Terraform configuration.

In this tutorial we the following 4 providers: azurerm, databricks, azuread, time.

Copy the below code into the providers.tf file.

terraform {
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "3.72.0"
}
databricks = {
source = "databricks/databricks"
version = "1.27.0"
}
azuread = {
source = "hashicorp/azuread"
version = "2.43.0"
}
time = {
source = "hashicorp/time"
version = "0.9.1"
}
}
}

provider "azurerm" {
features {}
skip_provider_registration = true
}

provider "azuread" {
# Configuration options
}

provider "time" {
# Configuration options
}

provider "databricks" {
host = azurerm_databricks_workspace.dbwdata01.workspace_url
azure_workspace_resource_id = azurerm_databricks_workspace.dbwdata01.id
}

Terraform Initialization

Terraform initialization is a critical step in the Terraform workflow. This is used to prepare your working directory and initialize the various components necessary for a Terraform project such as Providers.

Follow the below steps to initialize terraform:

  1. Open your terminal
  2. Go to the databricks folder path location
  3. Run the below command
terraform init

Terraform Variables

Terraform variables serve several essential purposes in Terraform configurations, allowing you to make your infrastructure code more dynamic, reusable, and maintainable.

Copy the below code into the variables.tf file.

Note: Make sure to set a unique name for the variable “company” otherwise you will get an error when trying to deploy the resources.

variable "env" {
type = string
default = "prod"
}

variable "dbwscope" {
type = string
default = "azkvdbwscope"
}

variable "stgaccname" {
type = string
default = "stacdata"
}

variable "default_location" {
default = "East US 2"
type = string
}

# Change the default value for a unique name
variable "company" {
default = ""
type = string
}


variable "secretsname" {
type = map
default = {
"databricksappsecret" = "databricksappsecret"
"databricksappclientid" = "databricksappclientid"
"tenantid" = "tenantid"
}
}

Terraform Main Code

Terraform allows you to define your infrastructure as code, meaning you can describe your entire cloud and infrastructure setup in human-readable and version-controlled configuration files. This approach provides numerous benefits, including version control, code review, and reproducibility of your infrastructure.

In the below code are listed all the resources for our architecture, the relationships and dependencies between them are also handle by terraform.

Copy the below code into the main.tf file.

data "azurerm_client_config" "current" {}
data "azuread_client_config" "current" {}
data "azurerm_subscription" "primary" {}

data "azuread_service_principal" "azuredatabricks" {
display_name = "AzureDatabricks"
}

locals {
stgaccname = "${var.stgaccname}${var.company}${var.env}01"
}

# Create main resource group
resource "azurerm_resource_group" "rgdata01" {
name = "rgdata${var.company}${var.env}01"
location = "${var.default_location}"
}

# Create storage account
resource "azurerm_storage_account" "stacdata01" {
name = local.stgaccname
resource_group_name = azurerm_resource_group.rgdata01.name
location = azurerm_resource_group.rgdata01.location
account_tier = "Standard"
account_replication_type = "LRS"
is_hns_enabled = true

blob_properties {
delete_retention_policy {
days = 1
}
container_delete_retention_policy {
days = 1
}
}

tags = {
environment = "${var.env}"
}
}

# Create containers for bronze, silver and gold layer
resource "azurerm_storage_container" "ctdatabronze" {
name = "ctdatabronze"
storage_account_name = azurerm_storage_account.stacdata01.name
container_access_type = "private"
}

resource "azurerm_storage_container" "ctdatasilver" {
name = "ctdatasilver"
storage_account_name = azurerm_storage_account.stacdata01.name
container_access_type = "private"
}

resource "azurerm_storage_container" "ctdatagold" {
name = "ctdatagold"
storage_account_name = azurerm_storage_account.stacdata01.name
container_access_type = "private"
}

# Create databricks workspace
resource "azurerm_databricks_workspace" "dbwdata01" {
name = "dbwdata${var.company}${var.env}01"
resource_group_name = azurerm_resource_group.rgdata01.name
location = azurerm_resource_group.rgdata01.location
sku = "standard"

tags = {
environment = "${var.env}"
}
}

# Create Key vault
resource "azurerm_key_vault" "kvdatabricks" {
name = "kv${var.company}${var.env}"
location = azurerm_resource_group.rgdata01.location
resource_group_name = azurerm_resource_group.rgdata01.name
enable_rbac_authorization = false
enabled_for_disk_encryption = true
tenant_id = data.azurerm_client_config.current.tenant_id
soft_delete_retention_days = 7
purge_protection_enabled = false

# Access policy principal account
access_policy {
tenant_id = data.azurerm_client_config.current.tenant_id
object_id = data.azurerm_client_config.current.object_id

key_permissions = ["Get", "Create", "Delete", "List", "Restore", "Recover", "UnwrapKey", "WrapKey", "Purge", "Encrypt", "Decrypt", "Sign", "Verify", "Release", "Rotate", "GetRotationPolicy", "SetRotationPolicy"]
secret_permissions = ["Backup", "Delete", "Get", "List", "Purge", "Recover", "Restore", "Set"]
storage_permissions = ["Backup", "Delete", "DeleteSAS", "Get", "GetSAS", "List", "ListSAS", "Purge", "Recover", "RegenerateKey", "Restore", "Set", "SetSAS", "Update"]

}

# Access policy for AzureDatabricks Account
access_policy {
tenant_id = data.azurerm_client_config.current.tenant_id
object_id = data.azuread_service_principal.azuredatabricks.object_id

secret_permissions = ["Get", "List"]
}

sku_name = "standard"
}

# Create Application
resource "azuread_application" "databricksapp" {
display_name = "svcprdatabricks${var.company}${var.env}"
owners = [data.azuread_client_config.current.object_id]
sign_in_audience = "AzureADMyOrg"
}

# Create Service Principal
resource "azuread_service_principal" "databricksapp" {
application_id = azuread_application.databricksapp.application_id
app_role_assignment_required = false
owners = [data.azuread_client_config.current.object_id]

feature_tags {
enterprise = true
gallery = true
}
}


resource "time_rotating" "two_years" {
rotation_days = 720
}

# Create secret for App
resource "azuread_application_password" "databricksapp" {
depends_on = [ azurerm_key_vault.kvdatabricks ]
display_name = "databricksapp App Password"
application_object_id = azuread_application.databricksapp.object_id

rotate_when_changed = {
rotation = time_rotating.two_years.id
}
}

# Assign role to service principal
resource "azurerm_role_assignment" "databricksapp" {
scope = azurerm_storage_account.stacdata01.id
role_definition_name = "Storage Blob Data Contributor"
principal_id = azuread_service_principal.databricksapp.id
}


# Store secret, clientid and tenantid in secret
resource "azurerm_key_vault_secret" "databricksappsecret" {
name = "${var.secretsname["databricksappsecret"]}"
value = azuread_application_password.databricksapp.value
key_vault_id = azurerm_key_vault.kvdatabricks.id
}

resource "azurerm_key_vault_secret" "databricksappclientid" {
name = "${var.secretsname["databricksappclientid"]}"
value = azuread_application.databricksapp.application_id
key_vault_id = azurerm_key_vault.kvdatabricks.id
}

resource "azurerm_key_vault_secret" "tenantid" {
name = "${var.secretsname["tenantid"]}"
value = data.azurerm_client_config.current.tenant_id
key_vault_id = azurerm_key_vault.kvdatabricks.id
}


# Create Databricks Cluster
data "databricks_node_type" "smallest" {
depends_on = [ azurerm_databricks_workspace.dbwdata01 ]
local_disk = true
category = "General Purpose"
}

data "databricks_spark_version" "latest" {
depends_on = [ azurerm_databricks_workspace.dbwdata01 ]
latest = true
long_term_support = true
}

# Grab secrets from azure key vault
data "azurerm_key_vault_secret" "databricksappclientid" {
depends_on = [ azurerm_key_vault_secret.databricksappclientid ]
name = "${var.secretsname["databricksappclientid"]}"
key_vault_id = azurerm_key_vault.kvdatabricks.id
}

data "azurerm_key_vault_secret" "databricksappsecret" {
depends_on = [ azurerm_key_vault_secret.databricksappsecret ]
name = "${var.secretsname["databricksappsecret"]}"
key_vault_id = azurerm_key_vault.kvdatabricks.id
}

data "azurerm_key_vault_secret" "tenantid" {
depends_on = [ azurerm_key_vault_secret.tenantid ]
name = "${var.secretsname["tenantid"]}"
key_vault_id = azurerm_key_vault.kvdatabricks.id
}

# Create Databricks Scope
resource "databricks_secret_scope" "dbwscope" {
depends_on = [ azurerm_databricks_workspace.dbwdata01, azurerm_key_vault.kvdatabricks ]
name = var.dbwscope
initial_manage_principal = "users"

keyvault_metadata {
resource_id = azurerm_key_vault.kvdatabricks.id
dns_name = azurerm_key_vault.kvdatabricks.vault_uri
}
}

# Create Single Node Cluster
resource "databricks_cluster" "dbcluster01" {
depends_on = [ databricks_secret_scope.dbwscope, data.azurerm_key_vault_secret.databricksappsecret ]
cluster_name = "dbcluster${var.env}01"
num_workers = 0
spark_version = data.databricks_spark_version.latest.id # Other possible values ("13.3.x-scala2.12", "11.2.x-cpu-ml-scala2.12", "7.0.x-scala2.12")
node_type_id = data.databricks_node_type.smallest.id # Other possible values ("Standard_F4", "Standard_DS3_v2")
autotermination_minutes = 20

spark_conf = {
"spark.databricks.cluster.profile" : "singleNode"
"spark.master" : "local[*]"

"fs.azure.account.auth.type.${local.stgaccname}.dfs.core.windows.net": "OAuth"
"fs.azure.account.oauth.provider.type.${local.stgaccname}.dfs.core.windows.net": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider"
"fs.azure.account.oauth2.client.id.${local.stgaccname}.dfs.core.windows.net": "${data.azurerm_key_vault_secret.databricksappclientid.value}"
"fs.azure.account.oauth2.client.secret.${local.stgaccname}.dfs.core.windows.net": "{{secrets/${var.dbwscope}/${var.secretsname["databricksappsecret"]}}}"
"fs.azure.account.oauth2.client.endpoint.${local.stgaccname}.dfs.core.windows.net": "https://login.microsoftonline.com/${data.azurerm_key_vault_secret.tenantid.value}/oauth2/token"
}

custom_tags = {
"ResourceClass" = "SingleNode"
}

}

Terraform Plan

The terraform plan command in Terraform is used to preview the changes that Terraform will make to your infrastructure before actually applying those changes. It is a crucial step in the Terraform workflow.

Run the below command:

terraform plan

You will get a plan that tells you how many resources are going to be deployed:

Terraform Apply

terraform apply is a Terraform command used to apply the changes specified in your Terraform configuration to your infrastructure. It is a critical step in the Terraform workflow, following the terraform plan command. When you run terraform apply, Terraform will make the necessary changes to your infrastructure to bring it in line with the desired state defined in your configuration.

Run the below command:

terraform apply

You will be asked whether you want to go forward with the deployment. If you are, type yes, click enter, and wait a bit for Terraform to do its job and provision all of the resources.

Validation

Upload a file to one of the containers (bronze, silver, gold) and read it from your Databricks cluster using Spark.

csv file uploaded to bronze container
# spark.read.format("").load("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data>")
df = spark.read.format("csv").load("abfss://ctdatabronze@stacdataamdprod01.dfs.core.windows.net/customer.csv")
df.display()

Terraform Destroy

terraform destroy is a Terraform command used to tear down or destroy the infrastructure resources created and managed by Terraform.

To destroy all the azure resources, run code below.

terraform destroy

--

--

Alonso Medina Donayre

I am very interested in topics related to Data, Software and Management.