Deploying and Managing Databricks Pipelines

Rudyar Cortes
TotalEnergies Digital Factory
4 min readApr 6, 2022
  1. Introduction

The data part of your team is ready to start a new use case. Everyone is excited about this and want to start building the first model as fast as possible. The team has taken two important decisions:

  • Use Databricks: A distributed system is required to process the massive amount of data generated by the use case.
  • Use Pyspark as distributed computing framework
  • Production can wait as getting results is more important at the moment.

As a Data Engineer, you start collecting data and make it available to Data Scientists in order to get the first results. After some weeks of work, the first model starts to show promising results and a production-ready model is required by Friday. The minimal checklist to go into production looks as follows

  • Schedule your data pipeline every morning in Databricks
  • Write unit/integration/e2e tests
  • Track your deployments and releases
  • Code refactoring
  • Write CI/CD

You suddenly realise that it would take weeks to meet the production requirements and that being in production from day 0 would have been useful.

The good news is that Databricks labs [1] proposes DataBricks CLI eXtensions (a.k.a. dbx) [2] that accelerates delivery by drastically reducing time to production. Using this tool, data teams can quickly deploy a pipeline to a target environment (using one command line) and you’re following the best practises proposed by Databricks regarding deployment.

In the next sections, the different components of dbx are detailed.

2. Architecture and workflow

Fig 2.1: Databricks deployment architecture

Fig 2.1 shows the architecture and workflow of a given Databricks deployment. A deployment consists of the following three components

  • Job pipeline: composed of one or more jobs (single or multi-task)
  • Common python package (wheel): The main python package used by the Job Pipeline.
  • MLFlow experiment: Associated to the Job pipeline

Once a deployment is defined it’s deployed to a target environment using dbx.

Deploying a Databricks pipeline consists in four steps:

  • Getting a starting point template
  • dbx tool configuration
  • Deployment definition
  • Deployment execution

In the next sections, I will detail each of these steps.

3. Getting a starting point template

In order to start working with dbx you need to install it as follows

Once you’ve initialised your template, you should get a directory structure as follows

As you might notice, the project structure is provided for you and you only need to focus on coding your jobs and unit/integration tests

In order to start working and carry out your first unit tests, you must install unit-requirements and install the lib in local development mode as follows

For further details about this part, please see [3]

4. DataBricks CLI eXtensions tool configuration

Once you’ve followed the steps presented in section 3, in order to get Databricks template in your project, it’s time to setup your environment and define your deployment.

The first step is to configure dbx. As dbx uses databricks-cli [4] under the hood, so you must first edit your ~/.databrickscg configuration file with a default profile. Fig. 3.1 shows an example of a databricks-cli configuration file.

Fig 3.1: databricks-cli configuration file

The tag [DEFAULT] identifies a Databricks profile which is composed of a host and a token. You can get details about how to generate your user token in [6]

Once your Databricks profile is created, you can configure dbx with the following command

Fig. 3.2: profile configuration using dbx

This command will configure your dbx profile and ask you for your MLFlow artifact location. The result will be written in the .dbx/project.json file. For further details about this step, please refer to [6].

Once you’ve finished this step, you’re ready to define your deployment configuration

5. Deployment configuration

Configuring a deployment consists of writing a simple configuration file stored by default in conf/deployment.json (it can also be a yaml file if you prefer). The end goal of this file is to declare the jobs that compose the data pipeline and the resources to be used (cluster size, number of workers, runtime version, and so on).

Fig. 4.1 shows an example of data pipeline which is scheduled every morning at 7 a.m. This data pipeline is composed of two jobs that execute sequentially. Every job creates a new job-cluster which will use resources only during job execution.

Fig 4.1: Deployment file example

When example_data_pipeline is executed the job_one starts by creating a single-node cluster running spark runtime 9.1. Once job_one finishes the job cluster associated to it is destroyed and job_two starts as there is a dependency established between two jobs using the key “depends_on”.

Note that this is a minimalist example of a data pipeline. There are more options you can add to your data pipeline such as email notifications, concurrency constraints, and so on. To get an exhaustive list of options, please refer to [7]

6. Your first Deployment and Launch

Once your deployment file is installed you can deploy your data pipeline by simply doing

Fig 5.1: Deployment using dbx

This dbx command will package your pipeline and deploy it to the default profile defined above. Your job will appear in the “jobs” section of your Databricks.

Once your deployment is ready, you can launch it as follows

Fig 5.2: Launch data pipeline using dbx

7. Conclusion

This article explains how to leverage Databricks Labs CI/CD templates and accelerate delivery on Databricks data pipelines. Data teams do not need to reinvent the wheel in order to deploy pipelines into production. Instead, they can use this template as base and enrich it with custom tools such as code code quality checks and security reviews.

References

[1] https://github.com/databrickslabs

[2] https://github.com/databrickslabs/cicd-templates

[3] https://dbx.readthedocs.io/en/latest/templates/python_basic.html

[4] https://docs.databricks.com/dev-tools/cli/index.html

[5] https://docs.databricks.com/dev-tools/api/latest/authentication.html

[6] https://dbx.readthedocs.io/en/latest/quickstart.html

[7] https://docs.databricks.com/dev-tools/api/latest/jobs.html

--

--