The Secret to Success in Large-Scale Data Engineering Projects

Automation in Databricks with Databricks Asset Bundles

Published in

Slalom Build

10 min readJun 6, 2024

In the evolving landscape of data engineering, selecting the right tools and methodologies is crucial for success. This article explores how Databricks Asset Bundles (DABs) can be leveraged for workflow implementation and automation in Databricks, significantly enhancing the efficiency and reliability of your DataOps.

Historically, data engineers using Databricks relied on a variety of tools for automation, including the Databricks UI, API, and CLI, Terraform, and the lesser-known dbx—an unofficial precursor to DABs. Although DBX was highly effective for workflow automation, its lack of official support posed risks that may be unacceptable in today’s critical systems.

Each of these tools has played a critical role in evolving data management practices, offering unique advantages and functionalities. Understanding the evolution of these tools provides valuable insights into modern data engineering practices in Databricks and highlights the necessity for officially supported robust tools in managing large-scale data projects.

Understanding Databricks Asset Bundles

In April 2024, Databricks Asset Bundles (DABs) achieved General Availability (GA), making them the official tool for workflow automation. DABs adopted an infrastructure-as-code (IaC) approach, which is pivotal for managing the configuration of cloud environments through code rather than manual processes. This methodology ensures that the infrastructure and code deployments are repeatable, consistent, and error-free.

DABs streamline the development and management of complex data projects on Databricks by packaging code and its associated infrastructure into bundles. Using simple Databricks CLI commands, a bundle is packaged including both the workflow code and the configurations needed for executing it on Databricks.

Bundles work based on YAML files for configuration, which are created during initialization and used for deployment. They specify parameters for your code and workflow attributes such as compute resources, permissions, alerts, and task orchestration.

This structured approach facilitates the implementation of continuous integration and continuous deployment (CI/CD) practices, enabling automated testing and deployment that can lead to faster release cycles and increased deployment reliability.

In practice, DABs are particularly advantageous in scenarios requiring frequent updates and rigorous testing of data pipelines. For example, in a financial analytics firm, DABs could be used to rapidly deploy updates to data models in response to changing market conditions, ensuring that the models operate with the latest data and logic without disrupting ongoing operations.

Working with Bundles

Deployment and Version Control

Deploying DABs requires the Databricks CLI, which facilitates direct interaction with Databricks from a developer’s local machine. By configuring the CLI, developers can then initiate a bundles project, enabling them to package code and infrastructure into deployable units.

The integration of Git plays a pivotal role in managing DABs projects. Git allows for the versioned history of both the application code and the infrastructure code, enhancing governance and ensuring compliance with necessary standards. This version control system is crucial for tracking changes, reviewing code modifications, and maintaining a comprehensive audit trail.

While DABs can be deployed from various environments such as your favorite IDE, your terminals, or directly within Databricks, CI/CD practices offer the most streamlined approach. A CI/CD pipeline automates testing and deployment of bundles, enabling rapid iteration and ensuring that changes are systematically validated before being deployed.

My particular interest lies in the power of CI/CD to transform data engineering workflows. By automating the deployment process, CI/CD not only accelerates development cycles but also significantly reduces the potential for human error, leading to more reliable and efficient data projects.

Setup

To begin working with DABs, you’ll need the Databricks CLI, so you’ll first need to install and configure it. The installation process varies slightly depending on your operating system, but it generally involves downloading the CLI tool from the official Databricks website and following the instructions.

Since my focus is on automation, preferably follow the token approach for authentication. This approach is particularly useful for CI/CD processes because it allows scripts and automation tools to access your Databricks environment securely without manual intervention from any developer.

I configured these variables in my local since this is the same way a CI/CD pipeline works:

DATABRICKS_CLI_PATH: Path to Databricks CLI
DATABRICKS_HOST: URL to my workspace
DATABRICKS_TOKEN: My personal access token (PAT)

Service Principals

The concept of a service principal (SP) generally refers to non-human users configured in various systems to perform automated tasks. These users are not interactive and are only meant to execute processes or scripts that require authentication within a system.

It is important to state that for proper automation of CI/CD pipelines, it is advised as best practice to make use of service principals.

A service principal is an identity that you create in Databricks for use with automated tools …. You can grant and restrict a service principal’s access to resources in the same way as you can a Databricks user.

Whether the SP is created manually or by Terraform, you can store its PAT in a key vault and retrieve it within your CD pipeline without any manual intervention.

Initialization

The first step in working with DABs is to initialize your project. This process sets up the necessary project structure and files. To begin, open your terminal and run the following command:

databricks bundle init

After executing this command, you’ll encounter a Search prompt in the console. This is part of the initialization process where you can specify the type of project template you want to use:

You can just press Enter to select a default Python project and follow the instructions.

You will be presented with several options for your project template. For those new to DABs, selecting the default Python project is a straightforward choice. Here’s how you can navigate the options:

Notebooks: When asked if you want to include notebooks, select No if you prefer a pure Python project without Databricks notebooks.
DLT Tables: Similarly, select No for DLT tables if you do not plan to use them in this project.
Python Project: Select Yes to create a Python project, which is suitable for a wide range of data engineering tasks.

If you want notebooks or DLTs later on, you can manually add them or create a separate project for them.

At the end it should look something like this:

Understanding the Project

After completing the initialization, your project directory will include several folders and files. Something like this:

For this guide, you don’t need fixtures and scratch. These folders are created by default but may not be necessary for most projects. Personally I have never made use of them. So, let’s delete those for now.

At this point, we have successfully initialized a project and we are ready to start working.

The Wheel Package

In case you are not yet familiar with a setup.py file, its purpose is to build and distribute Python packages. It contains essential information about the package, such as its name, version, dependencies, and instructions for installation.

This means that every package found in this directory will be packaged. For bundles, every package referenced in entry_points can become a workflow, meaning you can have multiple jobs/workflows in one project.

I made a few modifications to the default setup for the purpose of these examples:

from setuptools import setup, find_packages

PACKAGE_REQUIREMENTS = [
    'numpy>=1.18',  # Example dependency
    'pandas>=1.0',  # Another example dependency
]

setup(
    name="for_medium",
    version="1.0.0",
    author="Rafael Escoto",
    description="wheel file based on for_medium/src",
    packages=find_packages(where='.', exclude=["tests", "tests.*"]),
    entry_points={
        "packages": [
            "entry_point_one=src.workflow_one.main:main",
            "entry_point_two=src.workflow_two.main:main",
        ]
    },
    setup_requires=["setuptools", "wheel"],
    install_requires=PACKAGE_REQUIREMENTS,
)

databricks.yml

This file sits at the root of your project. It serves as the central configuration file for your bundle. It contains settings related to your Databricks workspaces and how you want your bundle to be deployed.

It includes references to other essential components of the bundle such as your jobs configurations found in the ./resources/ directory.

To simplify this demonstration, we can configure databricks.yml with a single target dev and add a variable just to demonstrate how to pass parameters down:

bundle:
  name: for_medium

variables:
  developerName:
    default: Rafael Escoto # default developer name

include:
  - resources/*.yml

targets:
  dev:
    default: true
    workspace:
      root_path: /Shared/.bundles/${bundle.target}/${bundle.name}

If you compare this file to the default one, you will notice that not only did we add a variable, but we also removed the host parameter from the workspace configuration. That is because, as mentioned above, it is configured as DATABRICKS_HOST to leverage environment variables and remove hardcoded parameters where possible.

Writing Jobs

Databricks uses databricks.connect for the out-of-the-box code. connect is out of the scope of this conversation, so we will use a much simpler example for this guide, since we are focusing on bundles and not on the code being executed or how it is being executed.

Let’s create the two modules ./src/workflow_one and ./src/workflow_two to simulate two workflows, and a third module ./src/custom_logger for a logger class to simplify logging.

Let’s write two main.py scripts, one for each workflow, with very simplistic code. Note that I will be using the same code for both functions:

import sys
from src.custom_logger.logger import Logger


logger = Logger.get_logger(__name__)


def main():
    developerName = sys.argv[1]
    logger.info(f"Logging from Workflow One for: {developerName}...")


if __name__ == "__main__":
    main()

This is the same Python code you can write in any Databricks workspace in the notebooks section, meaning you can make these scripts as complicated as you need to, just as if you were working in your Databricks workspace, with the added benefit of simple modularity, reusability, and CI/CD.

And as mentioned above, here is a sample logger class I often use for clear log messages:

import logging
import sys


class Logger:
    @staticmethod
    def get_logger(name):
        logger = logging.getLogger(name)
        handler = logging.StreamHandler(sys.stdout)
        formatter = logging.Formatter(
            "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
        )
        handler.setFormatter(formatter)
        logger.addHandler(handler)
        logger.setLevel(logging.INFO)

        return logger

At this point the./src directory has two modules ready to be executed as workflows and, the logger module we just discussed:

Configuring Jobs

Job configurations are typically stored in ./resources/. This directory contains YAML files that specify the settings and parameters for different elements workflows need to work, like jobs, pipelines, and experiments.

Technically you can lump all your resources in one file. But for clear separation of concerns and ease of maintenance, it is better to separate every item’s configuration into its own files. The result is that each resource has its own YAML file in ./resources/.

Let’s add two files in this directory: job_one_config.yml and job_two_config.yml. (There’s no need for any specific naming convention. If you look back at databricks.yml, the configured rule is resources/*.yml.)

The practice of segregating workflow configurations to their respective config files facilitates a structured and organized management of resources.

Now, let’s go over the minimum configuration to describe a job and the required compute. For an exhaustive list of possible configurations, refer to the full list of configurations. As with the Python code, I will be using the same configuration for the workflows (adjusting any naming and the entry_point, of course):

resources:
  jobs:
    job_one:
      name: job_one # You see as Job Name in the UI

      tasks:
        - task_key: job_one_task # You see as Task Name in the UI
          job_cluster_key: job_one_cluster
          python_wheel_task:
            package_name: for_medium
            entry_point: entry_point_one
            parameters: ['${var.developerName}'] # catalog, source blob storage, destination_table
          libraries:
            # By default we just include the .whl file generated for the for_medium package.
            - whl: ../dist/*.whl

      job_clusters:
        - job_cluster_key: job_one_cluster
          new_cluster:
            spark_version: 14.3.x-scala2.12
            node_type_id: i3.xlarge
            custom_tags:
              ResourceClass: SingleNode
            num_workers: 0

As you can see, this configuration makes reference to the Python package and points to the script we want to execute when the job runs. package_name and entry_point are defined in the setup.py; here we are just referencing them.

Deploying

At this point, it is possible to deploy this bundle containing two jobs. Let’s run:

databricks bundle deploy

If all goes as planned, you should get something like this:

Recap

So far, we have done a number of configurations:

Project Creation: Ran />databricks bundle init
Configured Databricks: Configured databricks.yml
Configured Jobs: Configured jobs in ./resources/<some_job>.yml
Wrote Jobs Code: Wrote code for our Python Modules
Deployed Jobs: Ran />databricks bundle deploy

Inspect Results

Now to inspect the created jobs. You can open the Workflows section in Databricks, and you should be able to find the two newly created jobs.

Let’s go inspect the tasks in job_one. Everything should match the configuration defined in ./resources/job_one_config.yml.

As expected, everything looks in order. Let’s run Job One:

And now Job Two:

With two jobs deployed via console and each configured with its own YAML config file, we can conclude this guide.

Next Steps

A completely automated deployment process requires not only the setup of your project as discussed in this article, but also the setup of your code repository continuous delivery actions. Setting up your CI/CD pipelines is the other half of this process, which wasn’t discussed here. This process varies depending on your Git provider (for example, GitHub, Azure DevOps, or Bitbucket).

I recommend looking at how to do the following:

Generate service principal tokens automatically using Terraform
Have Terraform store these tokens in a key store.
Pull these tokens from the key store during your CD process.

These tasks essentially remove the need for any manual authentication, and ensure that the owner of your Databricks resources is always a service principal and not a user.