Integrate Databricks Asset Bundles with CICD Pipelines on AWS

Published in

Databricks Platform SME

6 min readJun 6, 2024

Check out the sample repository on GitHub: here

This blog is part of a series of three posts, each focusing on a different cloud provider. For the blog focused on Azure cloud, see here, and for the one on GCP (coming soon).

Introduction

This blog will demonstrate how to use DABs to build a multi-workspace workflow using AWS native services.

Databricks Asset Bundles (DABs) is a great option for streamlining the development of complex data, analytics, and machine learning projects on the Databricks platform. DABs allow complete projects to be expressed as a collection of source files called a bundle. These files provide an end-to-end definition of a project, including details on testing and deployment.

The primary feature of DABs is to offer a structured approach to managing and deploying data engineering and ML projects. Compared to the Databricks Terraform Provider, which is also a great infrastructure as code (IaC) solution for managing cloud infrastructure, DABs provide a more flexible way for non-infra personas (such as data engineers and data scientists) by integrating the entire project lifecycle — including code, tests, and deployment configurations — into a single, cohesive bundle. For example, you can use DABs to deploy the Databricks MLOps Stacks, providing an efficient and streamlined way to handle complex project workflows.

The Need for DABs in Remote CI/CD Workflows

Databricks Asset Bundles (DABs), as an important feature and command set of the Databricks CLI, inherit and utilize all the authentication options supported by the Databricks CLI, including username and password, OAuth machine-to-machine (M2M) authentication, and OAuth user-to-machine (U2M) authentication. These methods allow users to authenticate using their locally configured credentials or profiles, which is both common and practical. Additionally, these authentication mechanisms can be extended to remote CI/CD pipelines, enabling seamless integration and deployment across various environments.

There are situations where authentication needs to occur within a remote CI/CD pipeline:

Orchestrating Workloads Across Multiple Workspaces: We need to use a CI/CD pipeline to orchestrate development, testing, and promotion workloads across multiple different workspaces, with a single repository serving as the sole source of truth.
Consistent Identity Usage: We want to run jobs or pipelines using specific identities, such as a user or service principal.
Production and Restricted Environments: In production and restricted Databricks environments, we may prefer not to grant individual users write permissions to most resources. Instead, users need to promote code to make changes in these restricted workspaces or environments.

High-level Architecture

*AWS CloudFormation template is used to deploy and manage all the required AWS resources such as CodeCommit; Secrets Manager; CodePipeline; Build jobs and etc. But it does not include any databricks resources.

Workflow Overview

This blog omits some basic knowledge of Databricks. If you are new to Databricks or DABs, we recommend starting with the official documentation.

1. The user can use DABs 'databricks bundle <options> 'commands to deploy the code in a dev workspace, and use this workspace to debug the notebook code and the DABs YAML template code.

( Optional ) Users can also directly associate the dev workspace with the remote repository using the Databricks Git integration to push code or create pull requests.

2. After development and debugging, users can push the local DABs folder/code to the AWS CodeCommit remote repository using Git, and then merge the code into the target branch (e.g., “main”) via a Pull Request in AWS CodeCommit.

3. Once the PR is merged into the target branch, it triggers the AWS CodePipeline.

4. In the QA build stage of the AWS CodePipeline, the pipeline retrieves the QA workspace PAT token from AWS Secrets Manager for authentication and deploys the resources defined in the DABs template to the QA workspace. In this sample solution, the QA stage runner also runs nutter tests to perform a sample test on one of the notebook functions.

5. If the QA stage completes successfully, it moves to the manual review stage. This stage requires manual approval from a reviewer with the appropriate permissions to promote the code to production.

6. Upon approval, the Prod stage runner retrieves the Prod workspace PAT token and deploys the resources defined in the DABs template to the Prod workspace, such as jobs or DLT pipelines.

Deployment Process Overview

Create CloudFormation Stack to deploy CI/CD required AWS resources:

Use the sample ‘codepipeline-stack.template’ provided in the sample repository to create a stack in CloudFormation in your chosen AWS region. Review and fill in the CloudFormation stack parameters.

2. Upload Databricks PAT Tokens to AWS Secrets Manager:

After the CloudFormation stack is successfully created, you will find the newly created repository in the CodeCommit service page and a ‘DABS_DEMO’ secret in the AWS Secrets Manager.
Use an AWS user or role with sufficient permissions to store the Databricks PAT tokens from your Dev and QA workspaces in the corresponding secret values.

3. Connect to the CodeCommit Repository:

Connect to the CodeCommit repository created by the CloudFormation stack. You can directly commit and push to the branch that triggers the pipeline, or you can commit to a development branch first and then create a Pull Request to merge your changes into the trigger branch (in this example, we use the “main” branch).
Note: Before pushing the code to the remote repository, ensure that the databricks.yml and the notebooks are correctly configured.

4. Trigger the CI/CD Pipeline run:

After pushing the code to the trigger branch, the CodePipeline run will be automatically triggered. Click “View details” on each stage to review the runner execution log.

5. Manual Approval:

Once the staging phase is completed, click “Approve” in the manual review stage to promote the pipeline to the production stage.

6. Production Deployment:

After the final “Production” stage is completed, you can view the runner execution output and logs by clicking “View details.”

7. Review Databricks Resources:

You can also review the deployed resources by logging into the QA and Prod workspaces in Databricks.

Cleanup (optional)

Delete the resources created in the Databricks workspace, either through the console or using the Databricks CLI.
Delete the CloudFormation stack from your AWS account.

Conclusion

This solution aims to demonstrate how to deploy and integrate Databricks Asset Bundles within an AWS CI/CD pipeline. The Python notebooks and pipelines used in this sample solution are basic examples that can be directly obtained using the databricks bundle init command. For more complex ML solutions, please refer to other Databricks ML references, such as the Databricks MLOps stack.

Reference

Databricks Asset Bundles

Databricks Asset Bundles for MLOps Stacks

AWS CloudFormation

AWS Secrets Manager

AWS CodePipeline with AWS CodeBuild to test code and run builds