Deploy Databricks on AWS with Terraform

Published in

Ostinato Rigore

8 min readMar 11, 2022

In this article we will discuss how to deploy a fully functioning and secure Databricks environment on top of AWS using Terraform as our IaaC tool (Infrastructure as Code).

If you want a detailed explanation from an architecture point of view, we have also written two separate articles for that purpose:

We will be using the following platform/tools:

AWS
Databricks
Terraform
GitHub
IDE of your choice

AWS will be our infrastructure provider, giving us access to compute and storage resources.

Databricks will be our orchestration layer, commanding AWS to create clusters and execute data processing code.

Terraform will be the tool we will use to deploy the resources needed on AWS and let Databricks know all the details so it can manage these after the initial deployment.

GitHub will be used as our VCS (Version Control System).

Getting started

First of all, we will go through all the accounts needed in order to be able to deploy our infrastructure:

AWS account: you can follow the instructions described here to create your own account. It is also necessary to create and access key and secret access key that we will be using in the following chapters. See this documentation that describes the steps to create them.
Databricks account: you can create a 2-week trial Databricks account following the instructions described here. Choose an Enterprise Account since some of the resources we will be deploying need this subscription plan. Check out this article for more information about subscriptions.
GitHub: you can create a free account here.

Once we have all the necessary accounts we can proceed to get into more interesting matters 😜

Quick introduction to Terraform

Terraform is an open-source cloud agnostic infrastructure as code tool. The biggest advantages it provides is that it can manage infrastructure from different cloud providers (GCP, AWS, Azure…), even in the same project. By describing the resources as code we will be able to parametrize it and deploy in any account in minutes.

The alternative would be to create all the resources through the graphical interface that AWS and Databricks provide, making the process error prone and it would also take a lot more time.

Get to know Terraform more in depth by visiting their web page.

Download Terraform code

The next step is to download the Terraform code from this repository. If you have just created your GitHub account you will need to set up ssh keys (explained here) to authenticate correctly. To download the repository open a terminal and execute the following command:

 git clone git@github.com:ajaen4/terraform-databricks-aws.git

Deep dive on code

We have divided the code into following modules: aws-databricks-roles, aws-kms, aws-s3-bucket, aws-vpc, databricks-cluster, databricks-management, databricks-provisioning and vpc-endpoints.

aws-databricks-roles

This module contains the implementation of all the roles needed for Databricks to operate on AWS. The most important ones are the following:

cross account role: this is the main role assumed by Databricks and used to deploy the cluster’s needed to run the data operations executed by the users.
meta instance profile: container role used to assume the data-read role or data-write role. In a nutshell, it is used to contain the roles to access the data located in our Data Lake formed by S3 buckets.
glue role: role used to access the AWS Glue Metastore. See more details here.
roles to access log buckets: roles to be able to leave different types of logs in our log buckets.

aws-kms

Module used to create the encryption keys that encrypt data written to the S3 buckets in order to implement a secure environment.

aws-s3-bucket

Module used to create the buckets needed to store our raw, prepared and trusted data and also the logs generated by the instances, VPCs, etc…

The raw data zone is the data in its native format, without any transformation.

The prepared data zone is produced by applying transformations to the raw data to obtain a common format.

The trusted data zone serves as a universal source of truth to the whole organization and is used widely for reporting.

aws-vpc

Module used to deploy a fully functioning VPC to host the clusters that are going to be run by Databricks.

Here we define how many public and private subnets we want, NAT Gatways, Security Groups (Firewalls), etc…

databricks-cluster

Module to describe the type of cluster we want to create. An example of a configuration where we define a cluster with instances of at least 15 GB of RAM, 2 cores and a minimum of 1 instance and a maximum of 2 is the following:

cluster_config = {cluster_name  = "High-Concurrency-Terraform"autotermination_minutes = 30node_config = {local_disk = falsemin_gb     = 15min_cores  = 2}autoscale = {min_workers = 1max_workers = 2}aws_attributes = {availability           = "SPOT_WITH_FALLBACK"first_on_demand        = 1spot_bid_price_percent = 60ebs_volume_count       = 1ebs_volume_size        = 32}}

databricks-provisioning

Module to register the infrastructure created on AWS in Databricks: authentication tokens, credentials (cross-account role), networks (VPC configuration), storage (root bucket used for internal uses) and workspaces to be created.

databricks-management

Module to create the group of users and to register the permissions of each group. Also used to define the logs we want to generate and the encryption keys to be used for each use case.

Databricks Terraform providers

About the ‘providers.tf’ file, it is important that we clarify the reason why we need the same Databricks provider but with different authentication parameters. The aliases are the following:

mws: used to authenticate with Databricks and register the AWS resources when no workspace has been created yet. That is why the “host” parameter is not pointing to a specific workspace.
created_workspace: used to create all the resources needed inside a Databricks workspace. As you can see the host is now pointing to a specific workspace created with the previous provider.
pat_token: used to create resources that due to Databricks restrictions a Service Principal can’t create, the most important one being that Service Principals can’t create clusters. We will use a Personal Access Token (PAT), a token created for a specific user with admin privileges, to create this type of resources.
service_ppal: used to create all the resources that a non nominal administration account should create, for example, user groups, user permissions, etc…

In the file ‘providers.tf’ you will find the following code:

// initialize provider in "MWS" mode to provision new workspaceprovider "databricks" {alias    = "mws"host     = "https://accounts.cloud.databricks.com"username = local.databricks_usernamepassword = local.databricks_pss}provider "databricks" {alias    = "created_workspace"host     = module.databricks_provisioning.databricks_hostusername = local.databricks_usernamepassword = local.databricks_pss}// provider with user's token auth, necessary because the service ppal// can't create clustersprovider "databricks" {alias = "pat_token"host  = module.databricks_provisioning.databricks_hosttoken = module.databricks_provisioning.pat_token}// provider with service ppal authprovider "databricks" {alias = "service_ppal"host  = module.databricks_provisioning.databricks_hosttoken = module.databricks_provisioning.service_ppal_token}

We will use these providers in the modules databricks-management and databricks-provisioning depending on the principal needed to create these resources.

Deployment

Install CLI components

Follow the instructions here to install terraform

Follow the instructions here to install the AWS CLI

Follow the instructions here to install jq

Prepare remote state infrastructure

In order to store the Terraform’s state remotely and not locally in our computer we will use the ‘bootstraper-terraform’ module. This module deploys an AWS bucket to contain the state in the cloud and a DynamoDB table to register the state locks.

This is a key functionality in order to be able to work with a distributed team. You can access more information in the following documentation.

To execute this code run this in your terminal from the root folder of the GitHub project:

export AWS_ACCESS_KEY_ID=XXXX
export AWS_SECRET_ACCESS_KEY=XXXX
export AWS_DEFAULT_REGION=XXXX

cd bootstraper-terraform
terraform init
terraform apply -var-file=vars/bootstraper.tfvars

You must fill in the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY of the AWS account we created in the previous chapters. Choose the AWS_DEFAULT_REGION where you would want to deploy this environment.

It is important that you choose wisely the variables declared in the ‘bootstraper-terrafom/vars/bootstraper.tfvars’ file because the bucket name is formed using these.

Give these variables personalized values since bucket names in AWS must be global; in other words, you can’t use a name for a bucket that has already been used by another user in AWS.

After deploying the infrastructure an output will be printed on screen, for example:

state_bucket_name = “eu-west-1-bluetab-cm-vpc-tfstate”

Copy this value, we will use it in the next chapter.

Environment deployment

Before we deploy anything of the infrastructure that holds our Databricks environment we must first create two parameters in AWS:

Account and username array: array parameter holding the account and username of the Databricks account, in that order. The variable that contains the path to this parameter is ‘databricks_acc_username_param_path’ in the ‘vars/databricks.vars’ file.
Password: SecureString parameter holding the Databricks account’s password. The variable that contains the path to this parameter is the ‘databricks_pss_param_path’ in the same file as the previous parameter.

This is necessary to avoid embedding credentials in our variables file. In case of not knowing how to create these parameters see this link to do so through the CLI or see this video to do it through the AWS console.

To deploy, the following commands must be run from the root folder:

export AWS_ACCESS_KEY_ID=XXXX
export AWS_SECRET_ACCESS_KEY=XXXX
export AWS_DEFAULT_REGION=XXXX
export BACKEND_S3=<VALUE_COPIED_PREVIOUS_CHAPTER>terraform init -backend-config="bucket=${BACKEND_S3}"
terraform apply -var-file=vars/databricks.tfvars

We will use the value copied in the previous chapter, the state bucket name, to give the variable BACKEND_S3 its value. As before, you must fill in AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_DEFAULT_REGION. Once the backend has been initialized, in the following deployments the backend-config doesn’t have to be rerun.

It is important that you choose the prefix variable in the ‘vars/databricks.vars’ wisely because it will be used to name all the infrastructure, including the buckets. These buckets as mentioned before must have a global unique name in AWS, so if you get an error deploying it is most likely that this is due to the fact that these names are already being used.

Scripts

The following scripts are auxiliar in our day to day basis to be able to refresh our tokens and also to reduce our networking costs (deleting the NAT Gateway and VPC Endpoints) when the environment is not being used. These scripts must be run from the root folder.

reset_tokens: When the tokens used to authenticate expire the script ‘reset_tokens.sh’ must be run to reset them. This script deploys only the ‘databricks_provisioning’ module in order to obtain new valid tokens. This is necessary because Terraform uses these tokens to authenticate when deploying/updating other resources in the ‘databricks_management’ module, and even though it knows it must update the tokens, it first has to update the infrastructure state for this module. Running this script will avoid this update operation on the ‘databricks_management’, being able to update the tokens.
sleep_network_infrastructure: This is an optional script to avoid incurring infrastructure costs when the deployment is not being used. It destroys the NAT used and the VPC Endpoints. To be able to use the deployment again just run:

terraform apply -var-file=vars/databricks.tfvars

Troubleshooting

It is important to personalize the variables used well, there can be conflicts with the names because AWS needs the names of some resources to be globally unique. If you get an error when deploying, check the error is not due to a name conflict of the type: “bucket already exists”.
When destroying the infrastructure, the file resource we use to contain an init script for the cluster can fail to be deleted. Run the following command to remove it from state management and try to destroy the infrastructure again:

terraform state rm module.databricks_management.databricks_dbfs_file.core_site_config

Happy coding! 😄