Databricks CI/CD using Azure DevOps — part II — CD

Published in

EcoVadis Engineering

9 min readSep 22, 2022

The second part of a series about CI/CD systems for multiple Databricks environments including tests, packages, notebooks, and init scripts using Azure DevOps.

Outline for Databricks CI/CD processusing Azure DevOps

It’s 2022 and you still don’t have your CI/CD pipelines for Databricks code ready? Don’t worry, I’ve got you covered. In the 2nd part of this miniseries, we’ll dive into the Continuous Deployment, so you’ll finally be able to reuse your Python libraries/notebooks/init scripts across different environments in a consistent manner.

Recap: Part I — CI

In the first part of the series I’ve shared an outline of the whole CI/CD process and its goals. I’ve also described how to start off with the required Azure services, how to structure the code in the repo and most importantly, how to obtain a packaged artifact with a tested Databricks-related code that features everything that you’ll ever need — init scripts, packages (specified as dependencies from pip as well as built as a part of the pipeline) and notebooks.

Databricks CI/CD using Azure DevOps — part I — CI

CI/CD system for multiple Databricks environments including tests, packages, notebooks and init scripts using Azure…

levelup.gitconnected.com

Sidenote — CI refactoring
Since Part I was published, I did some refactoring of the code. The functionality of the CI pipeline remains the same but some improvements were published to the main branch, such as one CLI file for all Python scripts, improving folder structure within the artifact, templates usage within the pipelines, etc.

Code

All of the code used for this CI/CD process can be found at:

https://github.com/szymonzaczek/databricks-ci-cd

Continuous Deployment CD

Specific goals of CD

The CD process described herein is supposed to do those things:

Install all of the Python dependancies on the specified Databricks clusters
Upload packages built in CI step (wheel files)
Upload init scripts to Databricks workspace
Upload notebooks to Databricks workspace
Seamless integration with multiple environments/Databricks workspaces

CI artifact

As a result from CI part of the process, we’re left off with an artifact that is ready to be deployed. Here is an example of the content from the built artifact:

Environments on Azure DevOps

Environments tab on Azure Devops workspace enables deployment control onto separate environments. It’s not a physical resource though — it can be considered a wrapper onto physical environments for facilitating deployments done via Azure Devops. Therefore, it’s up to you how you will treat each environment. It also allows you define access control and approvals required prior to the deployment. For more info about environments, check this link.

Herein, each environment will be considered as a different Databricks workspaces with separate key vaults. Different physical resources are specified in the following JSON config file:

CD template

CD pipeline has one essential difference in comparison to CI pipeline — steps of CD pipeline will be executed for as many times, as there are environments specified. In order to handle it properly, an environment parameter is specified that is of an object type. This syntax on ADO allows using an array as a parameter.

Artifacts will be downloaded directly from CI pipeline when it is successfully finished, so we need to specify pipeline_id and project parameters. service_connection is required for connecting to Azure resources. It is described further in Part I.

By setting trigger: none, automatic builds after commits are disabled. This is my preferred way for CD pipelines, since CD handles the artifacts, not the raw code itself. Instead, CD is automatically triggered when CI pipeline has successfully finished. This is achieved by specifying CI pipeline as an externalresource in YAML pipeline file (lines 16–21). Herein, I use ubuntu-latest as an agent’s operating system.

By iterating over the content of environment parameter we’ll deploy the same artifact onto multiple resources. Pipelines on Azure Devops allow using several levels of organizing logical operations, such as: stages, jobs, and steps. The actual work is done in steps and this is the place where individual tasks reside. Herein, I used a template file in order to have a separation of individual steps and the rest of the CD setup.

Deployment steps

Most of the pipeline parameters are pretty self-explanatory. environment should not really have a default value here, since it will always be passed from the file with stages anyway. artifact_databricks is the name of the artifact that you’ve specified in the CI pipeline. Since CD pipeline will be automatically run when the artifact is succesfully built, we can use latest version of the artifact here. pipeline_id is the id of the CI pipeline that you’ve created on your Azure DevOps workspace — this id can be established by going into the pipelines tab and checking out URL for the CI pipeline as shown on the screen shot.

Right now, we’re proceeding onto the actual tasks that are run on agent’s VM. At first, we need to download the artifact that was built and then extract the files from it. Then, we should install requirements for all of the scripts that will be used for CD process. Since we parametrized the relevant paths earlier on, we can just leverage those parameters here.

Next, the configuration file from JSON file is read — this was introduced in part 1 but the call is now made via ci_cd_cli.py file, which is currently the only CLI file in the repo. In the part 1 the configuration was read only for a single environment (since CI is built on dev environment) but right now different config will be read depending on the environment that we’re currently deploying to. It’s done thanks to passing
${{ parameters.environment }} into the function call.

Then, we need to read secrets from the Azure key vault. If your key vault lives inside a Virtual Network, you need to whitelist (and subsequently delist) Azure DevOps agent’s IP (VM that is actually executing the code) to this key vault — this is also discussed in the part 1. Next, we’re outputting databricks_token to a file — this will streamline the process of accessing Databricks workspace.

This code piece runs find_files_job function from find_files.py with different arguments.

Running this code will establish paths of the wheel, requirements, secrets, configs and init scripts files. Although hard coding paths may sometimes feel like a tempting solution at first, since it does not require you to write the actual code for doing this, whenever anything changes in how you build your artifact, those paths would need to be updated accordingly. Therefore, I believe it’s better to leverage some code for doing so. I provide here a simple script that searches for files that contains a given pattern in their name in a provided path. Once the code establishes where the files are located, those paths are exposed as environment variables on the agent itself by calling:

print(f”##vso[task.setvariable variable={variable_name}]{string_output}”)

directly from the python code. The code for those functions is in the gist above and in /databricks-ci-cd/ci_cd_scripts/find_files.pyfile. I’ve writtten this code in such a way that the paths to wheel files are written to whl_files environment variable (it can be accessed using $() syntax, so in order to access them, they should be called via $(whl_files)) — extension of the file here is used as the prefix for this variable. In case of looking for secrets.txt and requirements.txt files I’ve provided an additional argument to the function call (secret and requirements respectively) that will use this parameter as an environment variable prefix instead of a file extension.

The next step finally involves communication with Databricks API. In order to have some reusability of the components, I’ve created a class called DatabricksRequest that contains the actual interactions with Databricks API. It resides in the /databricks-ci-cd/ci_cd_scripts/databricks_api_class_internal.pyfile and it contains some basic interactions with the Databricks workspace, such as uploading and deleting files, interacting with clusters etc. Note that in order to actually install wheel file as a library on a cluster, you need to first upload it to the storage available on Databricks (it can be either a storage managed by Databricks or a storage mounted on the workspace) and then make another API call for installing given library on a chosen cluster. Also, if there is already a library with the same name installed there (that may be the case if you don’t change the version of the package that you’re building), you need to schedule its uninstallation via API call, schedule installation of the updated package via another API call, and make another call for restarting the cluster. Only after the successful restart, the library will be available on the cluster. Due to the inherent complexity of such operations, I’ve created another file that aggregates distinct workflows in the repo (/databricks-ci-cd/ci_cd_scripts/databricks_api_workflows_internal.py ). Herein, I will be describing using functions from this file one by one.

The first actual call to Databricks API involves installing the requirements of the package that you’re building. Owing to running script for outputting paths of config, secret and requirements files as environmental variables, you can just safely pass their values to python function process_dependencies, without exposing any sensitive data. This function iterates over clusters specified in the config, starts the cluster if it is currently in a TERMINATED and installs libraries specified in /databricks-ci-cd/package1/requirements/common.txt file via pip protocol.

This piece of code runs process_all_packages function that lives inside databricks_api_workflows_internal.py file:

Once the requirements for the library that was built during CI are installed on the cluster, nothing stops us from installing the actual library. This is done via calling process_all_packages function with paths to the config file, secret file and wheel files. Optionally, you can also provide custom path for dbfs (Databricks file system), though a default path hardcoded here in the code will work as well, since it’s just a directory to which wheel files will be uploaded. process_all_packages does two things — parses paths for wheel files from environmental variables and calls process_single_package function on the individual wheel files. process_single_packagefunction works quite similar to what is done in process_dependencies function but it processes wheel files instead of libraries installable via pip. Therein, a cluster is queried via API whether there is already a library installed with the same name. If there is, this library is uninstalled and the cluster is restarted. Then, the wheel file is uploaded to dbfs and installed onto the cluster.

After this step, you can finally use your custom library within your Databricks Workspace — just use standard python syntax for importing objects from any library (import x or from x import y).

Another quite useful option for interacting with custom-made libraries on Databricks Workspace is by using init scripts. Those are basically the shell scripts run every time that a given cluster is run (you can also use global init scripts that will be run on all of the clusters in your Workspace). Using them requires their upload to dbfs and either setting an init script path for a given cluster in UI or leveraging API for this purpose. This is here done with upload_init_script_workflow function that resides in databricks_api_workflows_internal.py:

My example here shows only uploading files to dbfs here, so I assumed that an init scripts was set using UI. Nonetheless, you can leverage the code that I’ve provided herein for interactions with the cluster and make your own call to Databricks endpoint 2.0/clusters/edit endpoint with init_scripts param as a part of the request. For the further documentation, please refer to link.

The final step in the pipeline is the upload of notebooks into the Databricks Workspace. This is done via upload_notebooks_workflow from databricks_api_workflows_internal.py:

In this code sample I’ve included code for parsing notebooks both with .py and .sql extensions. Parsing different file types requires some additional work, since when an individual notebook is uploaded to Databricks Workspace, an extension of the file is not sufficient — language param must also be passed with API call (line 196 in GIST databricks_ci_cd_35), As an argument to the upload_notebooks_workflow function, you need to provide a directory in the artifact where those notebooks are located. You can also specify a path on the Databricks Workspace to which those notebooks will be uploaded — I’ve set the default path to /deployed/notebooks/.

Reminder

All of the code for CI/CD can be found here: https://github.com/szymonzaczek/databricks-ci-cd

Summary

Databricks provides a remarkable platform for pretty much any data-related needs. Their disruptive ideas, such as the Data Lakehouse concept and a constant delivery of new technologies, such as Delta Lake, Delta Sharing, Unity Catalog, put them justifiably at the forefront of the industry.

However, the platform itself is not enough for ensuring the productivity of engineers. I guess no one will argue that if an engineer spends his time on copying files between environments it’s not the best use of his/her resources. That’s one of the reason why CI/CD processes are pretty much considered a necessity in a commercial environment.

By implementing CI/CD process on Azure DevOps that I’ve shared here and in the previous article you’ll get a perfect coding experience with Databricks platform. Obviously, your needs may be different from what I shared here — in that case you can use this code as a template that can be extended any way you want.

Thank you very much for reading my work. Happy coding!