Databricks CI/CD

3 min readJan 13, 2023

Brick By Boring Brick (I’m really starting to run out of song titles now)

Introduction

While my team has been heavy users of Databricks for close to a year now, one area where we’ve been lacking is with proper CI/CD practices when it comes to our jobs. Sure, we store our notebooks that run jobs in version control, but we’re still creating the job themselves in the UI.

For starters, you could certainly argue whether jobs, especially jobs meant for production, should be spun up in the UI or not. Considering we store Airflow pipelines and EMR configurations in version control so that we go through a proper code review process, we should definitely be doing the same when it comes to Databricks jobs. How can we do as such?

I want to clearly state that all credit for the design I’m going to lay out in this post goes to my colleague Phil and his team.

Databricks API

The Databricks REST API was naturally a good place to start. The API is very powerful, and it of course has the ability to interface with jobs. All that’s necessary to use the API is a personal access token (PAT), which is easy enough to create in the UI.

For any proper CI/CD process, a service user is needed that has all of the necessary permissions to handle the deployment. In Databricks, the concept can be achieved with service principals. For this to work as expected, a service principal and its corresponding PAT were created.

Job Configurations

As stated earlier, we want to be storing job configurations in version control so all configurations go through the proper code review process. After all, we want to make sure jobs are owned by service principals, integrated with Git, and using optimal cluster policies (gotta save those costs, right?).

We decided to approach this by storing those job configurations as YAML. If you’re scratching your head and wondering how you translate a UI job into its proper YAML representation (trust me, I was confused at first as well), you can go to a job and click on “View JSON” in the upper right-hand corner, although there are also helpful examples in the Databricks API documentation itself. If we store it in a form along those lines, then the API will be able to do the rest and spin it up for us as requested.

Automation

So, how we do put all the pieces together?

Git repo stores all the job configurations in YAML form
Jenkins pipeline uploads these configurations to S3, only uploading changed configurations (s3 sync)
Lambda function runs off of S3 event notifications (using SNS and SQS as intermediaries as a result of our federated approach). This function will in turn take each notification and run the corresponding create/update command in the Jobs API to get the jobs into Databricks.

Our design approach for Databricks CI/CD

All in all, it’s a similar approach to how we handled automated insights earlier (which would make sense, since Phil was my partner in crime for that one).

There are other approaches we could have taken for the automation. For example, the Jenkins pipeline could have ran the necessary commands itself. However, that’d require a lot of setup from our Jenkins team as it’s a new pipeline construct, so we decided just to go with the effective route of a Lambda function.

Another option would have been to use Terraform, which our admin team is already using for some aspects of the Databricks platform but not jobs. Perhaps the best answer is a self-serve Terraform repo, similar to how we already handle all our other Cloud infrastructure? I could definitely see that being a working reality.

Conclusion

Refinement and adoption of best practices has been my motto since we’ve become power users of Databricks. At the end of the day, CI/CD was an area we wanted to tackle, and as it turns out, we just needed to wait for the right opportunity to take advantage of it. Thanks to Phil for the collaboration on yet another great idea.