Reusable GitHub Actions workflows for Databricks deployment with dbx

Vechtomova Maria
Marvelous MLOps
Published in
4 min readJun 15, 2023

DBX is a great tool developed by Databricks Labs that simplifies Databricks job deployment by taking care of uploading all dependencies (python files, whl files). You no longer need Databricks job JSON definitions which can become huge and hard to read, DBX supports Jinja which brings a lot of flexibility. Tagging, passing environment variables and Python parameters — this all is also possible with DBX.

These are the main reasons we chose for DBX deployment with reusable GitHub Actions workflow to standardize ML model deployment on Databricks. When it comes to standardization, we want to keep things modular enough to accommodate more specific use cases, that's why we provide both reusable composite actions and reusable GitHub Actions workflows that work for internal & private repositories within the GitHub organization.

dbx & Github Actions = one love

Our Marvelous Actions & Workflow

We created a repository https://github.com/marvelousmlops/marvelous-workflows with the following structure:

├── README.md                          <- The top-level README
└── .github
├── workflows <- Reusable workflows folder
├── databricks_job_dbx.yml <- Workflow for dbx deployment
└── databricks_job_dbx.md <- Workflow documentation
└── actions <- Reusable composite actions
├── deploy_dbx <- Action for dbx deployment
└── setup_env_vars <- Action for setting up env vars

Reusable composite actions

  1. setup_env_vars

This action requires databricks_host and databricks_token as input and sets up 4 environment variables:

  • GIT_SHA (git commit hash)
  • PROJECT_NAME (repository name)
  • DATABRICKS_HOST
  • DATABRICKS_TOKEN

This action can be used as:

- name: Setup env vars
id: setup_env_vars
uses: marvelousmlops/marvelous-workflows/.github/actions/setup_env_vars@v1
with:
databricks_token: ${{ secrets.DATABRICKS_TOKEN }}
databricks_host: ${{ secrets.DATABRICKS_HOST }}

2. deploy_dbx

This action takes the following inputs:

  • workspace-dir: Workspace directory on Databricks, needed for dbx configuration
  • artifact-location: Artifact location on Databricks, needed for dbx configuration
  • deployment-file: dbx deployment file
  • run-job-now: if yes, directly runs the Databricks job

This composite action consists of the following steps:

  • setup python
  • checkout repository
  • install dependencies: dbx, databricks-cli
  • configure dbx project
  • get Databricks job name from dbx deployment file
  • deploy dbx & launch Databricks job if run-job-now input is “yes”

Note: This action requires DATABRICKS_TOKEN and DATABRICKS_HOST environment variables to be available on GitHub Actions runner.

This action can be used as:

- name: Deploy dbx
id: deploy_dbx
uses: marvelousmlops/marvelous-workflows/.github/actions/deploy_dbx@v1
with:
workspace-dir: "/Shared/amazon-reviews-databricks"
artifact-location: "dbfs:/Shared/amazon-reviews-databricks"
deployment-file: "conf/dbx_deployment.j2"
run-job-now: "yes"

Reusable GitHub Actions workflow

databricks_job_dbx.yml is a reusable workflow that consists of the following steps:

  • echo workflow inputs:

This step is important to be able to find out when the workflow exactly ran and what inputs it had at a later stage.

  • generate token

This step generates a GitHub token from GitHub App that has read permissions for all repositories within the organization. Find here how to set up an app. Store GitHub App ID and private key as organization secrets APP_ID, APP_PRIVATE_KEY. We recommend using an organization secret that has all repositories of the organization in its scope in this case.

  • setup GIT_TOKEN as an environment variable
  • Checkout marvelousmlops/marvelous-workflows repository using GIT_TOKEN

This step is extremely important. If we want to use composite actions which are defined in the same repository as the GitHub Actions workflow within that workflow, we can not use a relative path. When the workflow is
called from another repository, for example, marvelousmlops/amazon-reviews-databricks, GitHub runner would then look for the action defined in the amazon-reviews-databricks repository and fail.

We could just reference our actions as marvelousmlops/deploy_dbx@v1 and marvelousmlops/setup_env_vars@v1, but this approach would have some limitations: every time the workflow is updated and a new git tag is created, we would also need to update action version in the workflow, which would lead us to a loop situation.

If we just reference it as @master/ @develop, it will cause conflicts: if someone chooses to stay on version v1 of the workflow and actions in the corresponding branch get updated, the workflow may start doing
unexpected things. We want all versions to be stable.

It leaves us with the following solution: checking out the workflows repository with a certain reference. In that way, we can ensure that version v1 of the workflow will always stay the same

  • Setup environment variables: this step will execute setup_env_vars action described above.

Values for DATABRICKS_TOKEN and DATABRICKS_HOST are taken from the values of corresponding secrets. In this article, we explain how to set up a long-living DATABRICKS_TOKEN for SPN for automation.

  • Deploy dbx: this step will execute deploy_dbx action described above

Deployment example with reusable workflow

To show how to use reusable workflow, we will need another repository. We used https://github.com/marvelousmlops/amazon-reviews-databricks.

name: "Train and deploy amazon models dbx reusable"
on:
workflow_dispatch:

jobs:
deploy_job:
uses: marvelousmlops/marvelous-workflows/.github/workflows/databricks_job_dbx.yml@v1
with:
deployment-file: "recommender/dbx_recommender_deployment.yml.j2"
toolkit-ref: v1
run-job-now: "True"
secrets: inherit

This is how dbx_recommender_deployment.yml.j2 file looks like:

build:
python: "pip"

environments:
default:
workflows:
- name: "train_and_deploy_recommender_model"
job_clusters:
- job_cluster_key: "recommender_cluster"
new_cluster:
spark_version: "12.2.x-cpu-ml-scala2.12"
num_workers: 1
node_type_id: "Standard_D4s_v5"
spark_env_vars:
DATABRICKS_HOST: "{{ env['DATABRICKS_HOST'] }}"
DATABRICKS_TOKEN: {{ '"{{secrets/keyvault/DatabricksToken}}"' }}
GIT_SHA: "{{ env['GIT_SHA'] }}"
tasks:
- task_key: "train_model"
job_cluster_key: "recommender_cluster"
spark_python_task:
python_file: "file://recommender/train_recommender.py"
parameters: ["--run_id", "{{parent_run_id}}", "--job_id", "{{job_id}}"]
- task_key: "deploy_model"
job_cluster_key: "recommender_cluster"
spark_python_task:
python_file: "file://recommender/deploy_recommender.py"
parameters: ["--run_id", "{{parent_run_id}}", "--job_id", "{{job_id}}"]
depends_on:
- task_key: "train_model"

For more details about the deployment of amazon reviews models, refer to our articles:

In coming weeks, we will publish an article about using dbx and will explain the deployment file in details.

--

--