Dealing with private Python packages in Databricks Asset Bundles: part 1

Vechtomova Maria
Marvelous MLOps
Published in
4 min readAug 11, 2024

When working on multiple data projects, you start seeing some patterns emerge. Often, you use the same approach to clean, process, and validate your data. This is an indicator you should build a private Python package that can be used across all these projects. Over time, you may develop multiple private packages, and your setup may look something like this:

Databricks recommends using DAB (Databricks Asset Bundles) for deploying Databricks jobs (see our article on getting started with Databricks Asset Bundles: https://marvelousmlops.substack.com/p/getting-started-with-databricks-asset). Please read it before proceeding with the article because we use it to deploy Databricks job.

Two main questions arise:

  • Where to store all these Python packages?
  • How do you configure a DAB so the Databricks job can access these packages?

What is covered in the article

We puzzled with these questions quite a lot when designing our way of working, and came up with 4 main ways to deal with private Python packages in Databricks Asset bundles:

  • Use a private PyPi repository and cluster-scoped init scripts
  • Make private packages part of the bundle
  • Use Databricks Volumes to upload private packages
  • Initialize Databricks job cluster from a custom docker image

Even though all these options are working options, none of them are perfect. We intend to help you to make your own informed decision about what works best for you.

In this article, we cover the first 3 options. Dealing with docker images requires an article on its own (which will be covered in part 2).

Code for the article can be found here: https://github.com/marvelousmlops/dab_deployment/tree/master.

Use a private PyPi repository and cluster-scoped init scripts

Let's say we have package mlops-test stored in a private PyPi repository. For example, we use Azure DevOps and, for simplicity, authenticate with the service user's PAT (note: using SPN and Entra ID token is a more secure option — Entra ID token can be generated in the init script if SPN's credentials are stored in secret scope).

The following is required:

  • Secret scope (in our case, "mlops") and secret "pypi_token". The secret scope can be KeyVault-backed (recommended) or Databricks-backed.
  • Init script to specify extra-index-url.
  • Unity Catalog Volume (in our example, /Volumes/mlops_test/mlops_volumes/init_scripts) where we upload the init script.
  • Because init scripts can not access secret scopes directly, we need to create an environment variable

This is what the init script looks like (let's call it extra_index_url.sh. The index URL can be found on Azure DevOps under Connect to feed -> pip.

#!/bin/bash
if [[ $PYPI_TOKEN ]]; then
use $PYPI_TOKEN
fi
echo $PYPI_TOKEN
printf "[global]\n" > /etc/pip.conf
printf "extra-index-url =\n" >> /etc/pip.conf
printf "\thttps://$PYPI_TOKEN@<YOUR INDEX URL>\n" >> /etc/pip.conf

We need to copy this script to Volume. This can be done by:

databricks fs cp extra_index_url.sh dbfs:/Volumes/mlops_test/mlops_volumes/init_scripts

When specifying the job cluster, we set PYPI_TOKEN environment variable within spark_env_vars, and add extra_index_url.sh to init_scripts.

This is what databricks.yml looks like:

bundle:
name: demo-dab

artifacts:
default:
type: whl
build: poetry build
path: .

variables:
root_path:
description: root_path for the target
default: /Shared/.bundle/${bundle.target}/${bundle.name}


resources:
jobs:
demo-job:
name: demo-job
tasks:
- task_key: python-task
new_cluster:
spark_version: 13.3.x-scala2.12
node_type_id: Standard_D4s_v5
num_workers: 1
spark_env_vars:
PYPI_TOKEN: "{{secrets/mlops/pypi_token}}"
init_scripts:
- volumes:
destination: "/Volumes/mlops_test/mlops_volumes/init_scripts/extra_index_url.sh"
spark_python_task:
python_file: "main.py"
parameters:
- "--root_path"
- ${var.root_path}
libraries:
- whl: ./dist/*.whl
- pypi:
package: mlops-test==1.0.0
repo: <YOUR INDEX URL>

targets:
dev:
workspace:
host: <YOUR DATARICKS HOST>
root_path: ${var.root_path}

We found an article explaining a similar setup via UI: https://towardsdatascience.com/install-custom-python-libraries-from-private-pypi-on-databricks-6a7669f6e6fd. Note: it is a bit outdated (init script is stored on dbfs).

Make private packages part of the bundle

The first option seems rather complicated, so there is another option: download the wheel first (into a folder at the root level of the bundle — then it will be uploaded to the Workspace together with all other files) and refer to the location where the wheel is uploaded by DAB.

Run the following command at the root level of the bundle:

mkdir extra_dist
cd extra_dist
export $PYPI_TOKEN=<YOUR TOKEN>
pip download mlops-test==1.0.0 --index-url https://$PYPI_TOKEN@<YOUR INDEX URL> --no-deps

This is what databricks.yml would look like:

bundle:
name: demo-dab

artifacts:
default:
type: whl
build: poetry build
path: .

variables:
root_path:
description: root_path for the target
default: /Shared/.bundle/${bundle.target}/${bundle.name}


resources:
jobs:
demo-job:
name: demo-job
tasks:
- task_key: python-task
new_cluster:
spark_version: 13.3.x-scala2.12
node_type_id: Standard_D4s_v5
num_workers: 1
spark_python_task:
python_file: "main.py"
parameters:
- "--root_path"
- ${var.root_path}
libraries:
- whl: ./dist/*.whl
- whl: /Workspace/${var.root_path}/files/extra_dist/mlops_test-1.0.0-py3-none-any.whl

targets:
dev:
workspace:
host: <YOUR DATARICKS HOST>
root_path: ${var.root_path}

Use Databricks volumes to upload private packages

Instead of making the private packages part of the bundle, you can instead upload it to Unity Catalog Volume (in our example, /Volumes/mlops_test/mlops_volumes/packages).

First, copy the downloaded package to the Volume:

cd extra_dist
databricks fs cp mlops_test-1.0.0-py3-none-any.whl dbfs:/Volumes/mlops_test/mlops_volumes/packages

The only difference from the previous example is wheel location. It would be "/Volumes/mlops_test/mlops_volumes/packages/mlops_test-1.0.0-py3-none-any.whl". The rest is the same.

Conclusions and considerations

In this article, we showed 3 ways to deal with private packages. They slightly differ in complexity, and overall have one significant drawback: dependency resolution can be painful and things can break over time.

To prevent that from happening, we can do dependency resolution locally, and then inject pinned versions of all dependencies in libraries. It will result in a long and ugly databricks.yml definition.

Docker image-based cluster initialization solves this problem. We will talk about it in our next article.

--

--