Optimizing Databricks Workflows with CICD

Poojalakshmi Ramakrishnan
BI3 Technologies
Published in
5 min readJul 18, 2024
Optimizing Databricks Workflows with CI/CD

Introduction

Optimizing data pipelines and analytics workflows to be efficient and up-to-date is crucial in today’s data-driven environment. Continuous Integration and Continuous Deployment (CI/CD) practices are equally beneficial when applied to data engineering and analytics workflows. This blog is about how to Automate Data Updates: DevOps and Databricks Integration with CI/CD, focusing on practical implementation using Azure DevOps.

Let’s start by exploring the basics of CI/CD and the need of CI/CD.

What is CI/CD?

CI/CD is a set of practices aimed at improving software delivery through automation. CI involves automatically testing and integrating code changes, while CD focuses on automating the deployment of these changes. When applied to Databricks, CI/CD can help maintain the quality and reliability of the data workflows, reduce manual intervention, and speed up deployment cycles.

Problem statement and the need of CI/CD Pipeline

The primary problem we are facing is that the service principal is unable to generate an Azure DevOps PAT token.

However, we are still able to connect to the repository directly in the Databricks workflow by manually creating a DevOps PAT token using the user ID but the maximum validity period for a DevOps PAT is limited to one year. It is not recommended to utilize the user ID to produce a PAT token every year in order to resolve this issue.

So, the solution is to execute the project code directly from the Databricks workspace without configuring the Azure DevOps repository in the Databricks workflow. For that we need the real time data in the Databricks workspace.

Prerequisites to get Started

We’ll use Azure DevOps for setting Up CI/CD for Databricks, but the principles can be applied using other tools like GitHub Actions, GitLab CI, or Jenkins.

  1. Azure DevOps Account: Ensure to have an Azure DevOps account and a project set up.
  2. Databricks Workspace: Ensure to have access to a Databricks workspace.

Procedure

Step 1: Connecting Azure DevOps to Databricks

First, need to establish a connection between Azure DevOps and the Databricks workspace.

A. Generating a Personal Access Token in Databricks:

  • Go to Databricks workspace.
  • Select the profile located in the top-right corner to navigate to the settings.
Databricks Settings
  • Click on “Access tokens” and select “Generate new token”.
Generate Access tokens
  • Copy this token which will need it for Azure DevOps.

B. Creating a Service Connection in Azure DevOps:

  • Go to Azure DevOps project.
  • Navigate to Project Settings -> Service Connections.
  • Click on “New service connection”.
Create New Service Connection
  • Select “Generic” and enter the necessary details, including the Databricks instance URL and the token generated.

Step 2: Setting Up a Databricks Repo

Next, need to set up a repository in Databricks that will be synced with Azure DevOps repository.

A. Creating a Repository in Azure DevOps:

  • Create a new Git repository in Azure DevOps project.

B. Connecting the Repository to Databricks:

  • In Databricks, go to Repos and click on “Add Repo”.
Add Repo in Databricks
  • Enter the URL of the Azure DevOps repository.
Connect Repository to Databricks

Step 3: Creating a CI/CD Pipeline

A. Create a New Pipeline in Azure DevOps:

  • In Azure DevOps project, go to Pipelines and create a new pipeline.
  • Choose the repository which is set up earlier and click on starter pipeline
Create CICD Pipeline
Configure Pipeline

Here is a sample YAML configuration for a CI/CD pipeline that deploys changes to Databricks:

trigger:
branches:
include:
- main

pool:
vmImage: 'ubuntu-latest'

variables:
DATABRICKS_HOST: 'https://<your-databricks-instance>'
DATABRICKS_TOKEN: '<your-databricks-token>'

steps:
- task: UsePythonVersion@0
inputs:
versionSpec: '3.x'
addToPath: true

- script: |
python -m pip install --upgrade pip
pip install databricks-cli
displayName: 'Install Databricks CLI'

- script: |
databricks configure --token <<EOF
$(DATABRICKS_HOST)
$(DATABRICKS_TOKEN)
EOF
displayName: 'Configure Databricks CLI'

- script: |
databricks workspace import_dir . /Shared/ci_cd_demo --overwrite
displayName: 'Deploy Files to Databricks'

Step 4: Run the Pipeline

  • When going to add any new files or perform any changes in the existing files click commit by selecting that file and push changes to the Azure DevOps repository.
Commit Option
Final commit to reflect the changes in the selected branch
  • The pipeline will automatically trigger, running the defined steps: installing dependencies, configuring the Databricks CLI, and deploying files to Databricks.

Conclusion

Implementing CI/CD for Databricks workflows can significantly automate data updates and enhance the efficiency, reliability, and scalability of data engineering and analytics processes.

Benefits of CI/CD for Databricks Workflows

  1. Automation: Automate deployment processes, reducing manual intervention and errors.
  2. Consistency: Ensure consistent deployments and streamline version control for notebooks and workflows.
  3. Efficiency: Speed up development cycles and improve collaboration among data engineers and data scientists.
  4. Reliability: Catch errors early through automated processes and ensure the reliability of data workflows.

About Us:

Bi3 has been recognized for being one of the fastest-growing companies in Australia. Our team has delivered substantial and complex projects for some of the largest organizations around the globe, and we’re quickly building a brand that is well-known for superior delivery.

Website: https://bi3technologies.com/

Follow us on,
LinkedIn:
https://www.linkedin.com/company/bi3technologies
Instagram:
https://www.instagram.com/bi3technologies/
Twitter:
https://twitter.com/Bi3Technologies

--

--