How we’d set-up our Databricks deployment pipelines with Terraform repositories in 2024

JD Braun
Databricks Platform SME
7 min readFeb 7, 2024

The video and blog post have been co-authored by Tony Bo. A video representing this blog post content can be found here.

There are countless ways an organization can architect their Terraform repositories to deploy cloud infrastructure, from a 400-line main.tf file all the way to everything in tidy modules to be called downstream.

Some of these architectures stand the test of time, while others become brittle over the years and introduce rate limits, circular dependencies, or the accidental coupling of unrelated resources.

In this post, Tony and I outline how in 2024 we’d architect our Databricks Terraform repository for long-term scalability and usability, driven by two basic principles: isolation and reusability.

These principles are not unique to a Terraform repository with Databricks, but we wanted to walk through practical examples of how they’re applied.

You can find the example code base used to deploy these resources here. No warranties or guarantees are associated with this repository, but please leave Git Issues as needed.

NOTE: This blog post assumes a base level of familiarity with Terraform and Databricks. Before proceeding, you should be familiar with the basics of Terraform (e.g., plan, apply, etc.) and Databricks (Unity Catalog, workspaces, etc.).

Isolating Your Environments:

The first key point we architect for is the isolation of Terraform resources that have no relationship to each other. The atomic unit we’ll be working with, from a Databricks perspective, is a workspace.

This could mean separation by designation (e.g., development, QA, and production) or by business unit (finance, human resources, analytics, etc.). We aim to ensure that the fate of one workspace is not tied to another, providing downstream flexibility to iterate without fear.

Example of Breaking out State Files

As illustrated in the above picture, our state files look like this:

  • Unity Catalog — State File: A state dedicated to a single resource, the Unity Catalog metastore. This is a linchpin in the architecture that we don’t want to risk by integrating it into other parts of the repository.
  • Logging — State File: A state file containing the Databricks and cloud resources for audit and billing logs. Again, this is something that should be entirely unconnected to the rest of the infrastructure.
  • N+1 Workspace — State File(s): A state file for each workspace environment containing the Databricks and cloud resources needed for the underlying workspace. Examples include Databricks account resources (networking, credential, and storage config), cloud resources (VPC, subnets, security groups), and Databricks workspace resources (Workspace isolated catalog, cluster policies, etc.).

In the Databricks account console, you’ll see a neatly organized list of workspaces like:

List of workspace examples

However, each of them is its own individual pipeline, completely independent of the others. Adding components to the development workspace will eventually get promoted into a higher environment. Independent pipelines for each environment means faster iteration and faster time to narrow down terraform errors or miscellaneous issues.

Within each workspace, there are resources that can be controlled and tuned to that specific environment, such as cluster policies, workspace isolated catalogs, and more.

Catalog Isolation — Across Three Workspaces

For example, despite having full privileges to access these three catalogs, I’m only allowed to see the catalog in the workspace it’s bound to.

Enforcing isolation in both Terraform and user behavior — a win-win situation.

Reusability for Writing Once and Using Everywhere:

The second principle focuses on reducing the amount of work needed to spin up new resources. Let’s illustrate this by creating a new Databricks environment as an example.

Happy Path — Creating a New Databricks Workspace:

  1. A team requests the creation of a Databricks workspace.
  2. The DevOps team creates a new Terraform repository and clones relevant content from previous environments.
  3. The DevOps team updates the .tfvars file with appropriate variables for the new environment and performs a terraform init to initialize the state file.
  4. Deploy!
Example of a new_environment.tf calling various endpoints with its modules.

n each module, there can be sub-modules that are then called for specific purposes. For example, our first module, found in development, QA, or production, calls three different modules. We call our cloud_provider module, which, as mentioned above, handles the creation of networking, credential, and storage resources for the workspace. Then, using outputs from the cloud provider, we call the databricks_account, which will create the workspace, assign the metastore, and add any users. Finally, we interact with databricks_workspace, where we create cluster policies, workspace configurations, and a workspace catalog.

module "cloud_provider" {
source = "./cloud_provider"
providers = {
aws = aws
databricks = databricks.mws
}
aws_account_id = var.aws_account_id
databricks_account_id = var.databricks_account_id
resource_prefix = var.resource_prefix
region = var.region
vpc_cidr_range = var.vpc_cidr_range
availability_zones = var.availability_zones
public_subnets_cidr = var.public_subnets_cidr
private_subnets_cidr = var.private_subnets_cidr
sg_ingress_protocol = var.sg_ingress_protocol
sg_egress_ports = var.sg_egress_ports
sg_egress_protocol = var.sg_egress_protocol
dbfsname = var.dbfsname

}

module "databricks_account" {
source = "./databricks_account"
providers = {
databricks = databricks.mws
}

databricks_account_id = var.databricks_account_id
region = var.region
resource_prefix = var.resource_prefix
cross_account_role_arn = module.cloud_provider.cloud_provider_credential
bucket_name = module.cloud_provider.cloud_provider_storage
vpc_id = module.cloud_provider.cloud_provider_network_vpc
subnet_ids = module.cloud_provider.cloud_provider_network_subnets
security_group_ids = module.cloud_provider.cloud_provider_network_security_groups
metastore_id = var.metastore_id
user_name = var.user_name

}

module "databricks_workspace" {
source = "./databricks_workspace"
providers = {
aws = aws
databricks = databricks.workspace
}

aws_account_id = var.aws_account_id
databricks_account_id = var.databricks_account_id
resource_prefix = var.resource_prefix
workspace_id = module.databricks_account.workspace_id
uc_catalog_name = "${var.resource_prefix}-catalog-${module.databricks_account.workspace_id}"
workspace_catalog_admin = var.user_name
team = var.team

}

As we go deeper into one of these modules, the cloud_provider, you can see that these are further broken down into sub-modules. We break it out by the network, credential, and storage - or in other terms, the VPC, subnets, security groups of the workspace, the cross-account role for cluster spin-up, and the root S3 bucket for the workspace.

module "cloud_provider_network" {
source = "../../common_modules_cloud_provider/cloud_provider_network"

vpc_cidr_range = var.vpc_cidr_range
availability_zones = var.availability_zones
resource_prefix = var.resource_prefix
public_subnets_cidr = var.public_subnets_cidr
private_subnets_cidr = var.private_subnets_cidr
sg_ingress_protocol = var.sg_ingress_protocol
sg_egress_ports = var.sg_egress_ports
sg_egress_protocol = var.sg_egress_protocol

}

module "cloud_provider_credential" {
source = "../../common_modules_cloud_provider/cloud_provider_credential"

aws_account_id = var.aws_account_id
databricks_account_id = var.databricks_account_id
resource_prefix = var.resource_prefix
region = var.region
vpc_id = module.cloud_provider_network.cloud_provider_network_vpc
security_group_ids = module.cloud_provider_network.cloud_provider_network_security_groups

depends_on = [ module.cloud_provider_network ]
}

module "cloud_provider_storage" {
source = "../../common_modules_cloud_provider/cloud_provider_storage"

dbfsname = var.dbfsname
}

Then, if we open up the cloud_provider_network module, we can see the reusable code that is going to be called by each environment. This gives us flexibility to parameterize any of the variables needed for these resources.

module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "5.1.1"

name = "${var.resource_prefix}-data-plane-VPC"
cidr = var.vpc_cidr_range
azs = var.availability_zones

enable_dns_hostnames = true
enable_nat_gateway = true
single_nat_gateway = false
one_nat_gateway_per_az = true
create_igw = true

public_subnet_names = [for az in var.availability_zones : format("%s-public-%s", var.resource_prefix, az)]
public_subnets = var.public_subnets_cidr

private_subnet_names = [for az in var.availability_zones : format("%s-private-%s", var.resource_prefix, az)]
private_subnets = var.private_subnets_cidr
}

// SG
resource "aws_security_group" "sg" {
vpc_id = module.vpc.vpc_id
depends_on = [module.vpc]

dynamic "ingress" {
for_each = var.sg_ingress_protocol
content {
from_port = 0
to_port = 65535
protocol = ingress.value
self = true
}
}

dynamic "egress" {
for_each = var.sg_egress_protocol
content {
from_port = 0
to_port = 65535
protocol = egress.value
self = true
}
}

dynamic "egress" {
for_each = var.sg_egress_ports
content {
from_port = egress.value
to_port = egress.value
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
}
tags = {
Name = "${var.resource_prefix}-data-plane-sg"
}
}

module "vpc_endpoints" {
source = "terraform-aws-modules/vpc/aws//modules/vpc-endpoints"
version = "3.11.0"

vpc_id = module.vpc.vpc_id
security_group_ids = [aws_security_group.sg.id]

endpoints = {
s3 = {
service = "s3"
service_type = "Gateway"
route_table_ids = module.vpc.private_route_table_ids
tags = {
Name = "${var.resource_prefix}-s3-vpc-endpoint"
}
},
sts = {
service = "sts"
private_dns_enabled = true
subnet_ids = length(module.vpc.private_subnets) > 0 ? slice(module.vpc.private_subnets, 0, min(2, length(module.vpc.private_subnets))) : []
tags = {
Name = "${var.resource_prefix}-sts-vpc-endpoint"
}
},
kinesis-streams = {
service = "kinesis-streams"
private_dns_enabled = true
subnet_ids = length(module.vpc.private_subnets) > 0 ? slice(module.vpc.private_subnets, 0, min(2, length(module.vpc.private_subnets))) : []
tags = {
Name = "${var.resource_prefix}-kinesis-vpc-endpoint"
}
}
}
depends_on = [
module.vpc
]
}

There are a number of ways to architect this. However, we’ve chosen this path to reduce writing as much novel Terraform code as possible.

Wrapping It All Up:

As we mentioned at the beginning of this blog post, there are numerous ways to structure Terraform repositories for deploying Databricks environments, and this is just one of many examples.

However, regardless of your structure, we recommend adhering to the two principles of isolation and reusability in 2024.

Isolation: Isolate your environment to prevent collisions between environments. This approach helps avoid rate limits or circular dependencies. Importantly, we want to ensure that an update in a development environment does not inadvertently force a refresh of the production state.

Reusability: Reduce the workload for your DevOps team over the year by writing common, reusable modules that can be leveraged across all environments. This not only ensures consistency within your Databricks workspaces but also allows for the easy deployment of new resources and the updating of existing ones.

As noted above, we’ve published a working example repository for this blog post here. No warranties or guarantees, but GitHub issues are appreciated.

So, what are your thoughts? Do you think we missed anything? Are there any other key components essential for deploying scalable infrastructure on Databricks that we should include?

--

--