Databricks Assets Bundles (DABs): Deploying and Managing Data & AI Assets

Published in

TotalEnergies Digital Factory

5 min readMar 26, 2024

In recent years, there has been a concerted effort to simplify Databricks Data & AI asset development and deployment in order to deliver robust solutions. This endeavour involves adopting Software Engineering best practices across Data teams. These best practices encompass code versioning, testing, code packaging, and Continuous Integration and Delivery (CI/CD). By adhering to them, teams can confidently deploy Data & AI assets into production.

Achieving such a goal is a challenging task. Prior to the emergence of dedicated tools, numerous teams devised their own methods to adhere to Software Engineering best practices. They often resorted to developing in-house deployment tools, encountering numerous challenges along the way in their quest to deliver robust Data pipelines.

In 2020, Databricks Labs introduced Databricks CLI eXtensions (dbx), one of the pioneering tools designed to streamline Databricks workflows development and deployment (I previously authored an entire article on this topic). Leveraging such a tool, Data teams can efficiently develop, test and deploy complex data, analytics, and AI projects.

The increasing adoption of dbx paved the way for the release of Databricks Assets Bundles (DABs). Unlike dbx, DABs are now integrated into the official Databricks CLI and stand as the recommended tool for development and deployment on Databricks Data Intelligence Platform.

What are Databricks Assets Bundles?

Databricks Assets Bundles (DABs) allow teams to bundle Data & AI assets, test and deploy them in one package.

DABs provide a concise and declarative YAML syntax that makes it easier to manage deployments and environments. They facilitate continuous integration and deployment (CI/CD), and are especially useful in scenarios that involve multiple contributors, automation, and organizational standards.

With DABs, you can efficiently organize and manage your Databricks projects while adhering to Software Engineering best practices.

Initiating a project with DABs

In this section, we will initialize a project using DABs. Ensure that you have Databricks CLI version 0.205 or above. You can check your CLI version as follows:

databricks -v

If your CLI does not meet the required version, please install or update the CLI.

Once the CLI is installed, run the following command to initialise your repository.

databricks bundle init

This command sets up the initial structure of your repository, creating essential directories and files for managing your Databricks project. Here’s what the initialised repository looks like:


├── README.md
├── databricks.yml
├── fixtures
├── pytest.ini
├── requirements-dev.txt
├── resources
│   ├── my_project_job.yml
├── setup.py
├── src
│   └── my_project
└── tests
    └── main_test.py

Configuration files

The main DABs configuration files are the following

├── databricks.yml
├── resources
│   ├── my_project_job.yml

Deployment configuration is divided into two parts:

The deployment entrypoint databricks.yml which contains the environement target definitions
The YAML configuration files under ressources/ which contains your data assets definitions such as workflows, delta live tables, and ML experiments.

The following is an example of a file databricks.yml generated in the previous section. It declares two environements named dev and prod. It also imports the data assets definition under ressources/my_project_job.yml

# This is a Databricks asset bundle definition for my_project.
# See https://docs.databricks.com/dev-tools/bundles/index.html for documentation.
bundle:
  name: my_project

# Ressources include asset deployment declaration
include:
  - resources/my_project_job.yml

targets:
  # The 'dev' target, for development purposes. This target is the default.
  dev:
    # We use 'mode: development' to indicate this is a personal development copy:
    # - Deployed resources get prefixed with '[dev my_user_name]'
    # - Any job schedules and triggers are paused by default
    # - The 'development' mode is used for Delta Live Tables pipelines
    mode: development
    default: true
    workspace:
      host: DEV_WORKSPACE_URL

  ## Optionally, there could be a 'staging' target here.
  ## (See Databricks docs on CI/CD at https://docs.databricks.com/dev-tools/bundles/ci-cd.html.)
  #
  # staging:
  #   workspace:
  #     host: https://adb-5879118144375503.3.azuredatabricks.net

  # The 'prod' target, used for production deployment.
  prod:
    # We use 'mode: production' to indicate this is a production deployment.
    # Doing so enables strict verification of the settings below.
    mode: production
    workspace:
      host: PROD_WORKSPACE_URL
      root_path: /Shared/.bundle/prod/${bundle.name}
    run_as:
      # It is reccomended use a service principal here,
      # see https://docs.databricks.com/dev-tools/bundles/permissions.html.
      user_name: YOUR_SERVICE_PRINCIPAL

For further details about the configuration, please visit the official documentation.

Let’s take a look at the main job definition ressources/my_project_job.yml

# The main job for my_project.
resources:
  jobs:
    my_project_job:
      name: my_project_job

      schedule:
        # Run every day at 8:37 AM
        quartz_cron_expression: '44 37 8 * * ?'
        timezone_id: Europe/Amsterdam

      tasks:
        - task_key: main_task
          job_cluster_key: job_cluster
          python_wheel_task:
            package_name: my_project
            entry_point: main
          libraries:
            # By default we just include the .whl file generated for the my_project package.
            # See https://docs.databricks.com/dev-tools/bundles/library-dependencies.html
            # for more information on how to add other libraries.
            - whl: ../dist/*.whl

      job_clusters:
        - job_cluster_key: job_cluster
          new_cluster:
            spark_version: 13.3.x-scala2.12
            node_type_id: Standard_D3_v2
            autoscale:
                min_workers: 1
                max_workers: 2

This file defines a Databricks workflow composed of a single task (main_task) that run every day at 8:37 AM. It uses a job cluster which installs the generated package with autoscale option.

Building and deploying

Once configuration files are ready, packaging and deploying your code to a target environment can be done by executing

databricks bundle deploy -t YOUR_ENV

Once deployed, you can run your job as follows:

databricks bundle run my_project_job -t YOUR_ENV

Note that this command tool is specially useful to run integration tests that need components that are only available on the Databricks platform such as the Databricks File System (DBFS).

Developers will use the deploy and run commands to build, deploy and run their assets from local to the development environment.

On the other hand, CI/CD pipelines typically deploy against pre-production and production environments using a Service Principal as deployment identity.

For an extensive example about how to build a CI/CD pipeline using DABs, please refer to this Github repository.

Conclusion

In summary, Databricks Asset Bundles (DABs) emerge as a powerful tool for streamlining Continous Integration and Continous Deployment (CI/CD) of Data & AI Assets on the Databricks Intelligence Platform.

Using DABs, teams can adopt Software Engineering best practices across Data teams in order to deliver robust Data & AI solutions.

Feel free to customize and expand your repository based on your project requirements. Happy bundling!