Databricks Asset Bundles (DABs): Deploying with GitHub Actions for CI/CD

Automated Deployment of DABs using GitHub Actions and CI/CD

Yasser Mushtaq
Towards Data Engineering
13 min readJun 4, 2024

--

Photo by Kira auf der Heide on Unsplash

What are Databricks Asset Bundles?

In short, Databricks Asset Bundles (DABs) are an Infrastructure-as-Code (IaC) approach to deploying Databricks resources. This could be Databricks Jobs, Delta Live Tables (DLT) pipelines, and so on. They are a recent development by Databricks, aimed at simplifying and consolidating resource development and deployment.

Although not the aim of this discussion, an IaC approach has numerous benefits, such as reproducibility and incorporation into Continuous Integration and Delivery (CI/CD) workflows. Ultimately, using DABs will allow you to develop templated projects, and utilise software engineering best practices.

Aim

Having recently worked through setting up a DAB based project on Azure Databricks — here I go through the process of getting setup, particularly whilst incorporating with GitHub Actions for CI/CD workflows.

Although there is a great deal of documentation available, certain elements were trickier to determine, particularly in regards to GitHub Actions setup. In addition, there was a lot of back of forth across different documents to gather the required information. Therefore I thought it useful to provide a step-by-step guide!

Initial Steps

The aim of my project itself was to deploy a DLT pipeline for an ELT process. This was going to produce some end-user tables/views in Databricks. I wanted to set this up in a manner that was reproducible and utilised development best practices.

To get started with a basic setup of a DAB, and develop an understanding of how they work, I found it useful to do a local deployment first. The first step required locally is install Databricks CLI. This will allow you to test/deploy bundles from the command line. If you haven’t got this installed, you can find instructions for your OS here.

Authentication — Local User

Following local installation of Databricks CLI, you need to authenticate to your Databricks workspace. When using your local machine, in the docs this is referred to as OAuth user-to-machine (U2M) authentication. This approach wouldn’t be suited to an automated CI/CD system like GitHub Actions. We will cover this later, in OAuth machine-to-machine (M2M) authentication.

The steps for local authentication are straight forward, the first step is to run the following on the command line:

databricks auth login - host <workspace-url>

Replace <workspace-url> with your Azure Databricks URL. There are some small subsequent steps here to assign an authentication profile, which you can find here.

Creating a Bundle

Next, you complete the steps outlined in “Step 2” here. The result of following these steps is to create a template bundle for a DLT pipeline resource. As previously stated, my project involved deploying a DLT pipeline. You may have other use cases, in which case check the documentation for existing templates. There are some ‘common tasks’ at the bottom of this page. Examples include templates for an MLOps Stack.

Explore the Bundle

In the directory you created the bundle in, you can now explore the files the above process created. Some key ones are the databricks.yml, which includes configurations relating to the target workspace and deployment environments. Also pertinent to DLT include a job configuration YAML, which specifies a schedule, a task to run and configuration for a cluster. For my project, I kept the default setting.

resources:
jobs:
ae_sim_job:
name: ae_sim_job

schedule:
# Run every day at 8:37 AM
quartz_cron_expression: '44 37 8 * * ?'
timezone_id: Europe/Amsterdam

# can send an email on pipeline failure, although GitHub will also do this
# email_notifications:
# on_failure:
# - user@email.com

tasks:
- task_key: refresh_pipeline
pipeline_task:
pipeline_id: ${resources.pipelines.ae_sim_pipeline.id}

job_clusters:
- job_cluster_key: job_cluster
new_cluster:
spark_version: 13.3.x-scala2.12
node_type_id: Standard_D3_v2
autoscale:
min_workers: 1
max_workers: 4

The name of the job is prefixed with my project name ae_sim.

Referenced in the job is the pipeline task to run. This is detailed in the ae_sim_pipeline YAML file, as follows.

resources:
pipelines:
ae_sim_pipeline:
name: ae_sim_pipeline
target: ae_sim_${bundle.environment}
allow_duplicate_names: true
libraries:
- notebook:
path: ../src/ae-sim-elt-build.ipynb

configuration:
bundle.sourcePath: /Workspace/${workspace.file_path}/src

Here we define a target, which is the target schema (which you will see the result of below, basically ae_sim_dev/prod), as well as a reference to the notebook we want to run for the DLT pipeline.

These two code chunks above are kept in separate YAML files, and as such provide a modular approach to building our DAB.

We don’t go into details about the intricacies of DLT, but you can read more here. In short, they are a declarative framework for data pipelines, where you can include features like quality control checks.

Manual Validate, Deploy and Run

Follow steps 4, 5 and 6 from the same link as above to validate, deploy and run the bundle we have created. This deploys to a dev environment, and the result is a materialised view you can explore! Any changes you now make to the DLT notebook will be reflected in the workspace after you deploy and run the bundle again!

Things to Note

When you look through the bundle files, there are some key things to note and be aware of:

  • In the ‘targets’ for dev and prod, the workspace is defined as a host name in the format (if you use Azure Databricks): host: https://xxxxxx.x.azuredatabricks.net
  • When deploying in the real world, using GitHub, you wouldn’t explicitly declare this in the YAML here. It can be declared as a GitHub secret in the deployment environment.
  • Also under workspace for prod environment, is root_path, which is where all the files are deployed to. By default this will include your login username. However, we can also change this to a service principal ID.
  • Under run_as, which indicates the user the deployment runs under, by default you will see your login username. For a production environment, it is much more robust to use a service principal. This prevents run failures if the original developer leaves your organisation. In addition, you probably don’t want your username/host name visible in a repo, particularly if it is public!

Service Principal

The use of service principals relates to the aforementioned OAuth machine-to-machine (M2M) authentication method. This basically uses the credentials of an automated entity to authenticate with the desired target environment, and this is how we setup our GitHub Actions for CI/CD.

There are two ways by which you can setup a service principal, which will access your workspace and perform the required actions. One would be via your cloud provider, for instance using Microsoft Entra ID, or alternatively, you can create a service principal within your Databricks workspace itself. I opted for the latter.

The steps required to setup via Microsoft Entra ID start at step 1 here. Whereas I started from step 2. The steps outlined here basically require you to go to the manage account settings (top right of the DB workspace). Then user management > service principals > add service principal. Select the type of service principal (Azure or Databricks managed), type the display name and then add.

Note — you need to be an Azure Databricks account admin, or global admin to do this!

User management under manage account.

Finish steps 4 and 5 from the docs to set permissions and create a secret for the service principal. The principal will now have a unique ID and secret. Together with your workspace hostname, we now have the details we need to setup our deployment environment in GitHub Actions.

Prepare Bundle for Upload to GitHub Actions

We now have the elements we need to deploy an automated solution. However, there are some changes we can make to the databricks.yml file to reflect that we won’t be running the bundle from a local environment and will be using a service principal, not our own user ID.

These adjustments are under the targets section of the YAML file. Read the comments in the code below for detail.

targets:
# The 'dev' target, for development purposes. This target is the default.
dev:
# We use 'mode: development' to indicate this is a personal development copy:
# - Deployed resources get prefixed with '[dev my_user_name]'
mode: development
default: true
# below we normally define a workspace host, but here we declare in GitHub Actions env
# so the below workspace can be commented out/removed
# workspace:
# host: https://xxxxxx.x.azuredatabricks.net

## Optionally, there could be a 'staging' target here.
## (See Databricks docs on CI/CD at https://docs.databricks.com/dev-tools/bundles/ci-cd.html.)
#
# staging:
# workspace:
# host: https://xxxxxx.x.azuredatabricks.net

# The 'prod' target, used for production deployment.
prod:
# We use 'mode: production' to indicate this is a production deployment.
# Doing so enables strict verification of the settings below.
mode: production
workspace:
# again we don't need to specify the host here, we can do that in GitHub
# host: https://xxxxxx.x.azuredatabricks.net
# default always use /Users/user@email.com for all resources to make sure we only have a single copy.
# If this path results in an error, please make sure you have a recent version of the CLI installed.
# I amended below to use SP id in path not user email (fine for my purpose but you can leave this as user@email.com
# note this uses a variable sp_name, declared in the actual file
root_path: /Users/${var.sp_name}/.bundle/${bundle.name}/${bundle.target}
run_as:
# By default this runs with user email in production. We could also use a SP here,
# see https://docs.databricks.com/dev-tools/bundles/permissions.html.
# so here we use a SP, not a user_name
# user_name: user@email.com
service_principal_name: "dad6aa45-6871-xxxx-xxxx-xxxxxxxxxxxx"

In addition, we need to create our GitHub/workflows directory, for our CI/CD setup. These are the files GitHub Actions use for CI/CD workflows.

GitHub Actions allow for automation of software deployment workflows, and include CI/CD.

Workflow for CI/CD with GitHub Actions

The base GitHub workflow files were taken from this link, which gives you a basic setup for CI/CD with GitHub Actions. However, we make some changes to the default configuration to utilise our service principal ID and secret. We also use GitHub secrets to store the details we need.

Here is another reference I found useful to determine which environment variables to use in GitHub Actions for authentication. Particularly under the heading, Environment variables and fields for client unified authentication.

These variables are used when the client machine is attempting to authenticate with the workspace, and so are important get right!

To begin with, here is our GitHub Actions YAML file for our dev environment. This validates, deploys, and runs the specified job in our bundle within our dev workspace environment.

# This workflow validates, deploys, and runs the specified bundle
# within a pre-production target named "Dev".
name: "Dev deployment"

# Ensure that only a single job or workflow using the same concurrency group
# runs at a time.
concurrency: 1

# Trigger this workflow whenever a pull request is opened against the repo's
# main branch or an existing pull request's head branch is updated.
on:
pull_request:
types:
- opened
- synchronize
branches:
- main

jobs:
# Used by the "pipeline_update" job to deploy the bundle.
# Bundle validation is automatically performed as part of this deployment.
# If validation fails, this workflow fails.
deploy:
name: "Deploy bundle"
runs-on: ubuntu-latest

steps:
# Check out this repo, so that this workflow can access it.
- uses: actions/checkout@v3

# Download the Databricks CLI.
# See https://github.com/databricks/setup-cli
- uses: databricks/setup-cli@main

# Deploy the bundle to the "qa" target as defined
# in the bundle's settings file.
- run: databricks bundle deploy
working-directory: .
env:
# below as described here: https://learn.microsoft.com/en-us/azure/databricks/dev-tools/auth/oauth-m2m
# below we use params for machine to machine auth based on account ID and token
# you can also use service principal - better than user auth for prod
DATABRICKS_BUNDLE_ENV: dev
DATABRICKS_TOKEN: ${{ secrets.SP_TOKEN }}
DATABRICKS_ACCOUNT_ID: ${{ secrets.DB_USER }}
DATABRICKS_HOST: ${{ secrets.DB_HOST }}
# DATABRICKS_CLIENT_ID: ${{ secrets.SP_ID }}
# DATABRICKS_CLIENT_SECRET: ${{ secrets.SP_SECRET }}

# Validate, deploy, and then run the bundle.
pipeline_update:
name: "Run pipeline update"
runs-on: ubuntu-latest

# Run the "deploy" job first.
needs:
- deploy

steps:
# Check out this repo, so that this workflow can access it.
- uses: actions/checkout@v3

# Use the downloaded Databricks CLI.
- uses: databricks/setup-cli@main

# Run the Databricks workflow named "ae_sim_pipeline " as defined in the
# bundle that was just deployed.
- run: databricks bundle run ae_sim_pipeline --refresh-all
working-directory: .
env:
DATABRICKS_BUNDLE_ENV: dev
DATABRICKS_TOKEN: ${{ secrets.SP_TOKEN }}
DATABRICKS_ACCOUNT_ID: ${{ secrets.DB_USER }}
DATABRICKS_HOST: ${{ secrets.DB_HOST }}
# DATABRICKS_CLIENT_ID: ${{ secrets.SP_ID }}
# DATABRICKS_CLIENT_SECRET: ${{ secrets.SP_SECRET }}

Note that here we define the workflow behaviour to trigger whenever we open a pull request against the main branch or when an existing pull request head branch is updated. Also, we utilise some environment variables, which are defined in GitHub secrets. For the dev environment, you use the host name, account id (your login user email) and token. You could also use a service principal here if you wanted, which are the commented out variables (we will use these in the production environment).

The token can be generated locally after you have logged-in using Databricks CLI. You run the following command from your command line to generate the token:

databricks tokens create --comment <comment> --lifetime-seconds <lifetime-seconds> -p <profile-name>

Where your profile name is the authentication profile created when you authenticated to your workspace with the Databricks CLI. Find out more here.

Now whenever I open a pull request against the main branch, my development pipeline will begin. Remember that the run ae_sim_pipeline in the code above is a reference to my projects resource that is being executed.

Below is our GitHub Actions configuration for our production environment.

# This workflow validates, deploys, and runs the specified bundle
# within a production target named "prod".
name: "Production deployment"

# Ensure that only a single job or workflow using the same concurrency group
# runs at a time.
concurrency: 1

# Trigger this workflow whenever a pull request is pushed to the repo's
# main branch.
on:
push:
branches:
- main

jobs:
deploy:
name: "Deploy bundle"
runs-on: ubuntu-latest

steps:
# Check out this repo, so that this workflow can access it.
- uses: actions/checkout@v3

# Download the Databricks CLI.
# See https://github.com/databricks/setup-cli
- uses: databricks/setup-cli@main

# Deploy the bundle to the "prod" target as defined
# in the bundle's settings file.
- run: databricks bundle deploy --force-lock
working-directory: .
env:
#DATABRICKS_TOKEN: ${{ secrets.SP_TOKEN }}
# below as described here: https://learn.microsoft.com/en-us/azure/databricks/dev-tools/auth/oauth-m2m
# below we use params for machine to machine auth based on service principal - better than user auth
DATABRICKS_BUNDLE_ENV: prod
DATABRICKS_HOST: ${{ secrets.DB_HOST }}
DATABRICKS_CLIENT_ID: ${{ secrets.SP_ID }}
DATABRICKS_CLIENT_SECRET: ${{ secrets.SP_SECRET }}

# Validate, deploy, and then run the bundle.
pipeline_update:
name: "Run pipeline update"
runs-on: ubuntu-latest

# Run the "deploy" job first.
needs:
- deploy

steps:
# Check out this repo, so that this workflow can access it.
- uses: actions/checkout@v3

# Use the downloaded Databricks CLI.
- uses: databricks/setup-cli@main

# Run the Databricks workflow named "my-job" as defined in the
# bundle that was just deployed.
- run: databricks bundle run ae_sim_pipeline --refresh-all
working-directory: .
env:
#DATABRICKS_TOKEN: ${{ secrets.SP_TOKEN }}
DATABRICKS_BUNDLE_ENV: prod
DATABRICKS_HOST: ${{ secrets.DB_HOST }}
DATABRICKS_CLIENT_ID: ${{ secrets.SP_ID }}
DATABRICKS_CLIENT_SECRET: ${{ secrets.SP_SECRET }}

The workflow behaviour here is to trigger whenever a pull request is pushed into the main branch.

Note the main difference here is the use of the service principal ID and secret (DATABRICKS_CLIENT_ID, DATABRICKS_CLIENT_SECRET). This means when the production job is run, it is run under these credentials. This protects against dependencies on specific individuals.

With these files in your .github/workflows directory, you are good to go!

You now need to create a GitHub repository with the bundle files and the above GitHub workflow files included. You can view my own repository for an example if needed. Don’t forget to define your GitHub secrets correctly! Otherwise your run will fail very quickly!

Outcome

Once you hopefully have some nice green ticks in your GitHub Actions runs…

Successful deploy and run pipeline of DAB!

…You will see the outcomes in your Databricks workspace. Firstly, you’ll notice a folder in the users directory for the service principal, as well as your own personal user email. In addition, you will see a .bundle folder in each of these directories. This is where all the relevant deployment files and folders are kept. For instance, you will see your Python notebooks in the files/src/ subfolder. You will also find both dev and prod deployment folders.

In my use case, where I deployed a DLT pipeline, I also had both development and production pipelines visible in the DLT tab under data engineering.

Here is a basic DLT pipeline that my DAB deployment produced and ran. You will see the prefix dev in the heading, indicating this was a dev deployment by a certain user.

In comparison you will notice the production version of this pipeline run as the service principal ID, which will also be visible in the DLT tab.

In a real world scenario, you would deploy your development and production environments in separate Databricks workspaces. However, in my personal project, given I only had access to one workspace, I deployed both in the same workspace, but the target schema/tables were defined in the bundle files in a manner that identifies the environment type, as shown below.

Both dev and prod schemas within the same DB workspace.

Summary

In summary, here we demonstrated how to develop and deploy a DAB, Databricks Asset Bundle, an IaC feature to easily deploy resources to your Databricks workspace. In addition, we demonstrated the steps required to deploy using GitHub Actions for CI/CD, while utilising best practices such as service principal IDs for production environments.

This was developed while working on a personal project to deploy a DLT pipeline for an ELT pattern, in a automated fashion. You can view my project repository here.

Any feedback, comments, or to connect, you can reach me on LinkedIn.

Thanks for reading!

--

--

Yasser Mushtaq
Towards Data Engineering

Data analytics professional working in healthcare data science/engineering — here to learn and blog about all things data science and engineering.