CI/CD on Databricks using Azure Devops

Yatin Kumar
6 min readJun 14, 2023

--

Introduction

This blog post explains how to configure and build end to end CI/CD pipeline solutions on Databricks using Azure devops and best practices to deploy libraries in workspace using azure service principal in CI/CD pipeline for security aspects.

A Typical Azure Databricks pipeline includes following steps.

Continuous integration

  1. Develop code using databricks notebook or external IDE.
  2. Build libraries.
  3. Release — generate a release artifact.

Continuous deployment

  1. Deploy libraries or notebooks.
  2. Run automated tests.
  3. Programmatically schedule data engineering and analytics workflows.

Suppose you have developed your code using IDE or notebooks and committed to the Azure Git repository for which you would like to build a library whl or JAR file using DevOps principles.

Consider the following screenshot as your committed code for which you would like to build a library and pipeline.py as main python notebook which you would like to schedule for running analytics workflows.

Define the build pipeline

In this step you will define a build pipeline which will build deployment artifacts, copy it to staging directory and publish the artifact in azure pipeline which will later be used in release pipeline.

Step 1 :

First task in the build pipeline is to use the correct version of python to match the one on your remote databricks cluster.

Step 2 :

Use the command line to install required python modules for e.g pytest, requests.

Step 3:

Use the bash script task to run the following command and build the wheel file(library). Copy the wheel file and other notebooks which need to be deployed to a specific directory.

cd $(Build.Repository.LocalPath)
cp $(Build.Repository.LocalPath)/pipeline.py $(Build.BinariesDirectory)/
python3 setup.py sdist bdist_wheel
mkdir -p $(Build.BinariesDirectory)/libraries/python/libs
cp $(Build.Repository.LocalPath)/dist/*.* $(Build.BinariesDirectory)/libraries/python/libs

Step 4:

Next use copy files task to copy files to artifact staging directory from where it will be published as a build artifact.

Step 5 :

Publish the build artifact from staging to azure pipeline to be used further in release pipeline.

Till now, we have created the build pipeline and published the artifact in azure pipeline. You can publish the artifact as a file share also.

Define the release pipeline

Note : This example assumes that you had defined service principal in azure and added that to the databricks workspace and service principal had required privileges. The secret for the service principal is stored in the azure key vault.

Manage service principals — Azure Databricks

Step 1 :

Configure environment variables that the release pipelines reference by clicking the Variables button.

Set the following variables.

  • DATABRICKS_HOST : Which represents the workspace URL of azure databricks workspace.
  • DATABRICKS_TOKEN : represents your azure databricks personal access token or azure active directory token. In this example we will be using the service principal token.
  • DATABRICKS_CLUSTER_ID : represents the Azure databricks cluster ID in your databricks workspace.

Step 2 :

Configure the release agent for the release pipeline.

Step 3:

Add an artifact for the release. It is a build artifact which we published in the build pipeline.

Step 4 :

Add the tasks now and in the first task Set python version for the release agent. Make sure that the python version is compatible with the build and subsequent tasks.

Step 5 :

Get the secrets from the Azure key vault for the service principal which will be used in the subsequent steps to get the AD token.

Step 6 :

Get the token with the following code in bash script.This token gives you access to the databricks workspace. We will update the environment variable DATABRICKS_TOKEN in subsequent steps with this token.

echo "##vso[task.setvariable variable=TOKEN_VARIABLE;isoutput=true]$(curl -X POST -H 'Content-Type: application/x-www-form-urlencoded' \
https://login.microsoftonline.com/<tenant-id>/oauth2/v2.0/token \
-d 'client_id=<client-id>' \
-d 'grant_type=client_credentials' \
-d 'scope=2ff814a6-3304-4ab8-85cb-cd0e6f879c1d%2F.default' \
-d 'client_secret=$(demo-test-pipeline-cicd)' | jq --raw-output .access_token)"

change client secret value demo-test-pipeline-cicd appropriately as defined in the azure account.

Replace:

  • <tenant-id> with the registered application’s tenant ID.
  • <client-id> with the registered application’s client ID.

Step 7 :

Use the powershell script to update the environment variable DATABRICKS_TOKEN with the following code. Please make sure you have the edit release privilege to do this in azure devops.

$url = "$($env:SYSTEM_TEAMFOUNDATIONSERVERURI)$env:SYSTEM_TEAMPROJECTID/_apis/Release/definitions/$($env:RELEASE_DEFINITIONID)?api-version=5.0-preview.3"
Write-Host "URL: $url"
$pipeline = Invoke-RestMethod -Uri $url -Headers @{
Authorization = "Bearer $env:SYSTEM_ACCESSTOKEN"
}
Write-Host "Pipeline = $($pipeline | ConvertTo-Json -Depth 100)"

# Update an existing variable named DATABRICKS_TOKEN to its new value.
$pipeline.variables.DATABRICKS_TOKEN.value = $env:BASH1_TOKEN_VARIABLE

####****************** update the modified object **************************
$json = @($pipeline) | ConvertTo-Json -Depth 99


$updatedef = Invoke-RestMethod -Uri $url -Method Put -Body $json -ContentType "application/json" -Headers @{Authorization = "Bearer $env:SYSTEM_ACCESSTOKEN"}

write-host "=========================================================="
Write-host "The value of Variable DATABRICKS_TOKEN is updated to" $updatedef.variables.DATABRICKS_TOKEN.value
write-host "=========================================================="

Step 8 :

Use Bash script task to install databricks cli on agent.

Step 9 :

Below script copies the wheel from artifact directory to dbfs which will be deployed on clusters. Change the paths accordingly.

databricks fs cp - overwrite $(System.ArtifactsDirectory)/$(Release.PrimaryArtifactSourceAlias)/application_$(Build.BuildId)_$(Build.BuildNumber)/libraries/python/libs/TestProject-1.0-py3-none-any.whl dbfs:/tmp/application/libraries/python/libs/TestProject-1.0-py3-none-any.whl

Step 10 :

Deploy the scripts in the build to workspace with following code which will be running for analytics workloads.

databricks workspace import - language=PYTHON - format=SOURCE - overwrite $(System.ArtifactsDirectory)/$(Release.PrimaryArtifactSourceAlias)/application_$(Build.BuildId)_$(Build.BuildNumber)/pipeline.py /Shared/democicd/pipeline.py

The final pipeline will look like the following.

Summary :

In this blog post we represented an end to end approach for implementing CI/CD pipelines on Azure Databricks for IDE or notebook based projects. We used service principal to implement the pipeline and generated the AAD token on the runtime to keep things secure.

References :

Manage service principals — Azure Databricks | Microsoft Learn

Introducing Azure DevOps | Azure Blog and Updates

--

--