How we’d set-up our Databricks deployment pipelines with Terraform repositories in 2024

Published in

Databricks Platform SME

7 min readFeb 7, 2024

The video and blog post have been co-authored by Tony Bo. A video representing this blog post content can be found here.

There are countless ways an organization can architect their Terraform repositories to deploy cloud infrastructure, from a 400-line main.tf file all the way to everything in tidy modules to be called downstream.

Some of these architectures stand the test of time, while others become brittle over the years and introduce rate limits, circular dependencies, or the accidental coupling of unrelated resources.

In this post, Tony and I outline how in 2024 we’d architect our Databricks Terraform repository for long-term scalability and usability, driven by two basic principles: isolation and reusability.

These principles are not unique to a Terraform repository with Databricks, but we wanted to walk through practical examples of how they’re applied.

You can find the example code base used to deploy these resources here. No warranties or guarantees are associated with this repository, but please leave Git Issues as needed.

NOTE: This blog post assumes a base level of familiarity with Terraform and Databricks. Before proceeding, you should be familiar with the basics of Terraform (e.g., plan, apply, etc.) and Databricks (Unity Catalog, workspaces, etc.).

Isolating Your Environments:

The first key point we architect for is the isolation of Terraform resources that have no relationship to each other. The atomic unit we’ll be working with, from a Databricks perspective, is a workspace.

This could mean separation by designation (e.g., development, QA, and production) or by business unit (finance, human resources, analytics, etc.). We aim to ensure that the fate of one workspace is not tied to another, providing downstream flexibility to iterate without fear.

As illustrated in the above picture, our state files look like this:

Unity Catalog — State File: A state dedicated to a single resource, the Unity Catalog metastore. This is a linchpin in the architecture that we don’t want to risk by integrating it into other parts of the repository.
Logging — State File: A state file containing the Databricks and cloud resources for audit and billing logs. Again, this is something that should be entirely unconnected to the rest of the infrastructure.
N+1 Workspace — State File(s): A state file for each workspace environment containing the Databricks and cloud resources needed for the underlying workspace. Examples include Databricks account resources (networking, credential, and storage config), cloud resources (VPC, subnets, security groups), and Databricks workspace resources (Workspace isolated catalog, cluster policies, etc.).

In the Databricks account console, you’ll see a neatly organized list of workspaces like:

List of workspace examples

However, each of them is its own individual pipeline, completely independent of the others. Adding components to the development workspace will eventually get promoted into a higher environment. Independent pipelines for each environment means faster iteration and faster time to narrow down terraform errors or miscellaneous issues.

Within each workspace, there are resources that can be controlled and tuned to that specific environment, such as cluster policies, workspace isolated catalogs, and more.

Catalog Isolation — Across Three Workspaces

For example, despite having full privileges to access these three catalogs, I’m only allowed to see the catalog in the workspace it’s bound to.

Enforcing isolation in both Terraform and user behavior — a win-win situation.

Reusability for Writing Once and Using Everywhere:

The second principle focuses on reducing the amount of work needed to spin up new resources. Let’s illustrate this by creating a new Databricks environment as an example.

Happy Path — Creating a New Databricks Workspace:

A team requests the creation of a Databricks workspace.
The DevOps team creates a new Terraform repository and clones relevant content from previous environments.
The DevOps team updates the .tfvars file with appropriate variables for the new environment and performs a terraform init to initialize the state file.
Deploy!

Example of a new_environment.tf calling various endpoints with its modules.

n each module, there can be sub-modules that are then called for specific purposes. For example, our first module, found in development, QA, or production, calls three different modules. We call our cloud_provider module, which, as mentioned above, handles the creation of networking, credential, and storage resources for the workspace. Then, using outputs from the cloud provider, we call the databricks_account, which will create the workspace, assign the metastore, and add any users. Finally, we interact with databricks_workspace, where we create cluster policies, workspace configurations, and a workspace catalog.

module "cloud_provider" {
    source = "./cloud_provider"
    providers = {
        aws = aws
        databricks = databricks.mws
    }
    aws_account_id        = var.aws_account_id
    databricks_account_id = var.databricks_account_id
    resource_prefix       = var.resource_prefix
    region                = var.region
    vpc_cidr_range        = var.vpc_cidr_range
    availability_zones    = var.availability_zones
    public_subnets_cidr   = var.public_subnets_cidr
    private_subnets_cidr  = var.private_subnets_cidr
    sg_ingress_protocol   = var.sg_ingress_protocol
    sg_egress_ports       = var.sg_egress_ports
    sg_egress_protocol    = var.sg_egress_protocol
    dbfsname              = var.dbfsname
  
}

module "databricks_account" {
    source = "./databricks_account"
    providers = {
        databricks = databricks.mws
    }

      databricks_account_id  = var.databricks_account_id
      region                 = var.region
      resource_prefix        = var.resource_prefix
      cross_account_role_arn = module.cloud_provider.cloud_provider_credential
      bucket_name            = module.cloud_provider.cloud_provider_storage
      vpc_id                 = module.cloud_provider.cloud_provider_network_vpc
      subnet_ids             = module.cloud_provider.cloud_provider_network_subnets
      security_group_ids     = module.cloud_provider.cloud_provider_network_security_groups
      metastore_id           = var.metastore_id
      user_name              = var.user_name
  
}

module "databricks_workspace" {
    source = "./databricks_workspace"
     providers = {
        aws = aws
        databricks = databricks.workspace
    }

    aws_account_id        = var.aws_account_id
    databricks_account_id = var.databricks_account_id
    resource_prefix       = var.resource_prefix
    workspace_id          = module.databricks_account.workspace_id
    uc_catalog_name       = "${var.resource_prefix}-catalog-${module.databricks_account.workspace_id}"
    workspace_catalog_admin = var.user_name
    team = var.team
  
}

As we go deeper into one of these modules, the cloud_provider, you can see that these are further broken down into sub-modules. We break it out by the network, credential, and storage - or in other terms, the VPC, subnets, security groups of the workspace, the cross-account role for cluster spin-up, and the root S3 bucket for the workspace.

module "cloud_provider_network" {
  source = "../../common_modules_cloud_provider/cloud_provider_network"

  vpc_cidr_range       = var.vpc_cidr_range
  availability_zones   = var.availability_zones
  resource_prefix      = var.resource_prefix
  public_subnets_cidr  = var.public_subnets_cidr
  private_subnets_cidr = var.private_subnets_cidr
  sg_ingress_protocol  = var.sg_ingress_protocol
  sg_egress_ports      = var.sg_egress_ports
  sg_egress_protocol   = var.sg_egress_protocol

}

module "cloud_provider_credential" {
  source = "../../common_modules_cloud_provider/cloud_provider_credential"

  aws_account_id        = var.aws_account_id
  databricks_account_id = var.databricks_account_id
  resource_prefix       = var.resource_prefix
  region                = var.region
  vpc_id                = module.cloud_provider_network.cloud_provider_network_vpc
  security_group_ids    = module.cloud_provider_network.cloud_provider_network_security_groups

  depends_on = [ module.cloud_provider_network ]
}

module "cloud_provider_storage" {
  source = "../../common_modules_cloud_provider/cloud_provider_storage"

  dbfsname = var.dbfsname
}

Then, if we open up the cloud_provider_network module, we can see the reusable code that is going to be called by each environment. This gives us flexibility to parameterize any of the variables needed for these resources.

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.1.1"

  name = "${var.resource_prefix}-data-plane-VPC"
  cidr = var.vpc_cidr_range
  azs  = var.availability_zones

  enable_dns_hostnames   = true
  enable_nat_gateway     = true
  single_nat_gateway     = false
  one_nat_gateway_per_az = true
  create_igw             = true

  public_subnet_names = [for az in var.availability_zones : format("%s-public-%s", var.resource_prefix, az)]
  public_subnets      = var.public_subnets_cidr

  private_subnet_names = [for az in var.availability_zones : format("%s-private-%s", var.resource_prefix, az)]
  private_subnets      = var.private_subnets_cidr
}

// SG
resource "aws_security_group" "sg" {
  vpc_id     = module.vpc.vpc_id
  depends_on = [module.vpc]

  dynamic "ingress" {
    for_each = var.sg_ingress_protocol
    content {
      from_port = 0
      to_port   = 65535
      protocol  = ingress.value
      self      = true
    }
  }

  dynamic "egress" {
    for_each = var.sg_egress_protocol
    content {
      from_port = 0
      to_port   = 65535
      protocol  = egress.value
      self      = true
    }
  }

  dynamic "egress" {
    for_each = var.sg_egress_ports
    content {
      from_port   = egress.value
      to_port     = egress.value
      protocol    = "tcp"
      cidr_blocks = ["0.0.0.0/0"]
    }
  }
  tags = {
    Name = "${var.resource_prefix}-data-plane-sg"
  }
}

module "vpc_endpoints" {
  source  = "terraform-aws-modules/vpc/aws//modules/vpc-endpoints"
  version = "3.11.0"

  vpc_id             = module.vpc.vpc_id
  security_group_ids = [aws_security_group.sg.id]

  endpoints = {
    s3 = {
      service         = "s3"
      service_type    = "Gateway"
      route_table_ids = module.vpc.private_route_table_ids
      tags = {
        Name = "${var.resource_prefix}-s3-vpc-endpoint"
      }
    },
    sts = {
      service             = "sts"
      private_dns_enabled = true
      subnet_ids          = length(module.vpc.private_subnets) > 0 ? slice(module.vpc.private_subnets, 0, min(2, length(module.vpc.private_subnets))) : []
      tags = {
        Name = "${var.resource_prefix}-sts-vpc-endpoint"
      }
    },
    kinesis-streams = {
      service             = "kinesis-streams"
      private_dns_enabled = true
      subnet_ids          = length(module.vpc.private_subnets) > 0 ? slice(module.vpc.private_subnets, 0, min(2, length(module.vpc.private_subnets))) : []
      tags = {
        Name = "${var.resource_prefix}-kinesis-vpc-endpoint"
      }
    }
  }
  depends_on = [
    module.vpc
  ]
}

There are a number of ways to architect this. However, we’ve chosen this path to reduce writing as much novel Terraform code as possible.

Wrapping It All Up:

As we mentioned at the beginning of this blog post, there are numerous ways to structure Terraform repositories for deploying Databricks environments, and this is just one of many examples.

However, regardless of your structure, we recommend adhering to the two principles of isolation and reusability in 2024.

Isolation: Isolate your environment to prevent collisions between environments. This approach helps avoid rate limits or circular dependencies. Importantly, we want to ensure that an update in a development environment does not inadvertently force a refresh of the production state.

Reusability: Reduce the workload for your DevOps team over the year by writing common, reusable modules that can be leveraged across all environments. This not only ensures consistency within your Databricks workspaces but also allows for the easy deployment of new resources and the updating of existing ones.

As noted above, we’ve published a working example repository for this blog post here. No warranties or guarantees, but GitHub issues are appreciated.

So, what are your thoughts? Do you think we missed anything? Are there any other key components essential for deploying scalable infrastructure on Databricks that we should include?

How we’d set-up our Databricks deployment pipelines with Terraform repositories in 2024

Isolating Your Environments:

Reusability for Writing Once and Using Everywhere:

Wrapping It All Up:

Written by JD Braun