Demystifying Databricks Deployment Strategies: The Ultimate Guide for Novices

Introduction

Published in

Epsilon Engineering Blog

7 min readJan 16, 2024

With the global growth of data-centric businesses, the volume and complexity of data continue to grow exponentially, raising a need for advanced tools and technologies to automate data engineering needs. Databricks helps address these requirements being a powerful, reliable, and scalable cloud-based data engineering solution for analysis, processing, and transformation of huge sizes of data. At the heart of Databricks, lies a deeply optimized Spark engine that provides low latency processing and high throughput. Databricks takes away the complexities associated with setting up a distributed cluster infrastructure, hence allowing us to focus on generating value from our data.

This blog imparts insights on getting started with Databricks and step-by-step instructions for using Databricks SDK to accelerate deployments and automate common tasks like:

1. Manage Workspaces

2. Develop and execute notebooks

3. Jobs and cluster provisioning

4. Establish access permissions

5. Configure secrets

Databricks SDK Setup

The Databricks SDK (Software Development Kit) is a library that gives a high-level abstraction over Databricks APIs to programmatically interact with Databricks with a wide range of operations. The REST APIs exposed by the platform serve the same goal by making HTTP requests for resource provisioning.

The requirements to execute these scripts either locally or via Databricks notebooks are:

1. Databricks Premium account

2. AWS Cloud account for background storage and compute

3. Python 3.8 or higher for local development

4. Install the SDK package: pip install databricks-sdk

Authentication

The SDK allows for a convenient way to authenticate with Databricks using Personal Access Tokens. The community edition does not support this feature, though this tutorial can be tried with the 14-day free trial premium version. Databricks recommends using a token associated with a service principal for production applications, however, for the sake of simplicity we will use our Databricks user to generate a token. Here are the steps:

1. Login to Databricks

2. Go to the User Settings tab

3. User -> Developer tab

4. Access tokens -> Generate a new token with a configurable expiry time.

5. Store the token securely

Code Configuration

Throughout this blog, we will be using this workspace client which has been configured with the authorization token. Depending on the user's access level, the code outputs will be generated accordingly.

from databricks.sdk import WorkspaceClient
w = WorkspaceClient(
host = '<databricks-url>',
token = '<replace with your token here>'
)

Notebook Creation

A notebook is a UI-based runnable file that contains code, data visualizations, and runtime-generated outputs. Notebooks may include Bash scripts, File System access scripts, SQL, Py, R, or Scala using magic commands. For this demo, we will create a simple Python notebook.

import base64
from databricks.sdk.service.workspace import ImportFormat, Language

content = base64.b64encode("print('Hello world!')".encode("utf-8")).decode("utf-8")
w.workspace.import_(
                "/Test.py",
                content=content,
                language=Language.PYTHON,
                format=ImportFormat.SOURCE,
                overwrite=True,
            )

Workspace Management

Workspace is where we manage all Databricks artifacts. The CRUD (Create Read Update Delete) functionalities like listing folders and files inside them, creation of notebooks, deletion of notebooks, etc. are done using the WorkspaceAPI class. Here is a sample to list all objects under the root path:

for f in w.workspace.list("/"):
print(f.path)

Job Scheduling with Cluster Creation

Jobs are used to define the implementation flow for data processing and computing applications. A job could consist of a notebook, a query, a script, etc. They may be used to automate monotonous tasks. Job triggers, schedules, job identification tags, email notifications, automatic retry, job timeouts, parameters, and several other settings are available.

Cluster management is done using the self-service Databricks model. A cluster definition needs to be given while creating an on-demand one-time-run job. Here, a cluster will be spun up for the execution of the job and will be terminated upon completion of the job. The cluster can be directly linked with an instance profile ARN to get access to AWS resources based on the configurations set in that IAM role. In continuation of our example, we will run the above-created notebook using a job. The method returns a run_id which is used to track the status of the job.

from databricks.sdk.service import jobs, compute
from databricks.sdk.service.jobs import RunLifeCycleState
import time

def create_cluster():
    job_cluster_info = {
        "num_workers": 2,
        "spark_version": "13.3.x-scala2.12",
        "spark_conf": {},
        "aws_attributes": {
            "first_on_demand": 1,
            "availability": "SPOT_WITH_FALLBACK",
            "zone_id": "us-east-1a",
            "spot_bid_price_percent": 100,
            "ebs_volume_count": 0,
        },
        "node_type_id": "m5d.large",
        "driver_node_type_id": "m5d.large",
        "custom_tags": {
            "process": "Demo-Notebook",
        },
        "spark_env_vars": {},
        "enable_elastic_disk": False,
        "cluster_source": "JOB",
        "init_scripts": [],
        "enable_local_disk_encryption": False,
        "data_security_mode": "SINGLE_USER",
        "runtime_engine": "STANDARD",
    }
    return compute.ClusterSpec.from_dict(job_cluster_info)


job = w.jobs.submit(
    run_name="TestTask",
    tasks=[
        jobs.SubmitTask(
            new_cluster=create_cluster(),
            notebook_task=jobs.NotebookTask(
                notebook_path="/Test.py",
                base_parameters={"env": "Test-Env"},
            ),
            task_key="TestTask",
        )
    ],
)
run_id = job.run_id
print(f"Job submitted with run_id: {run_id}")

run_details = w.jobs.get_run(run_id=run_id)
current_status = run_details.state.life_cycle_state
while current_status is RunLifeCycleState.RUNNING:
    print("Job is in progress")
    time.sleep(120)
    current_status = w.jobs.get_run(run_id=run_id).state.life_cycle_state
print("Job completed")

View Reports

Databricks provides exhaustive job progress logs and metrics monitoring. After the completion of a job, we can download the run report using the following script. Our job would take around 8–10 minutes to end.

exported_view = w.jobs.export_run(
    run_id=run_id, views_to_export=jobs.ViewsToExport.ALL
)
with open("Report.html", "w") as f:
    f.write(exported_view.as_dict()["views"][0]["content"])
print("Report generated")

Grant Permissions

By leveraging the access control list capabilities of Databricks, fine-grained permissions can be applied to all the resources like clusters, tables, databases, secrets, notebooks, etc. Specific users and groups may be created to restrict access on a need basis. For this demo, let us update the permissions of our job by making it read-only for all users:

from databricks.sdk.service.iam import AccessControlRequest

run_details = w.jobs.get_run(run_id=run_id)
job_id = run_details.job_id

acl = [
    AccessControlRequest.from_dict(
        {
            "group_name": "users",
            "permission_level": "CAN_VIEW",
        }
    )
]
permissions = w.permissions.update(
    request_object_type="jobs",
    request_object_id=job_id,
    access_control_list=acl,
)
print("ACL updated for the job ", permissions)

Additional resources

Configure Secrets

Secrets are used to manage and store sensitive information like usernames and passwords. They are securely encrypted and can be accessed programmatically from within the code. Here, we will create a secret to store the authorization token. The secret value cannot be printed; nevertheless, we will list the secret keys to verify that the secret has been stored.

scope = "TestScope"
w.secrets.create_scope(scope=scope)
w.secrets.put_secret(scope=scope, key="Token", string_value="<secret_value>")
print(w.secrets.list_secrets(scope=scope))

The output obtained is:

[SecretMetadata(key='Token', last_updated_timestamp=1700853998752)]

Error Handling/ Logging

Adopting the best coding practices ensures robustness in the system. Databricks raises specific exceptions with appropriate error codes that may be caught with a try-catch block. The code snippet below raises the following error: Exception occurred: Path (/TestDummy.py) doesn’t exist.

from databricks.sdk.core import DatabricksError

try:
    path = "/TestDummy.py"
    w.workspace.get_status(path)
    w.workspace.delete(path, recursive=True)
except DatabricksError as e:
    print("Exception occurred:", e)

CI/CD Integration

Continuous Integration/ Continuous Deployment is a software development methodology that concentrates on the frequent integration of development code several times a day into a shared repository and its continuous deployment into production. This reduces manual errors and improves collaboration across teams. This process can be streamlined with automation tools like GoCD, Jenkins, etc. to yield a faster iteration and swift software delivery.

Git Integration on Databricks

It is possible to clone a Git Repository and perform other Git operations on Databricks. The platform also offers methods to collaborate with other users in the workspace.

1. Go to the Repos tab under Workspace

2. Add the Git Repo URL

3. Provide the repository access credentials

4. Click on Create Repo

5. Furthermore, branches, commits and code pull requests can be created from the Databricks UI

End-to-end CI-CD processes starting from Dev to QA to UAT to Prod can be established using these SDK-integrated Python scripts to deploy the assets on the respective environments on Databricks.

Conclusion

Databricks offers comprehensive capabilities to handle big data workloads, both batch and real-time, with just-in-time processing. In conclusion, the Databricks SDK is a powerful tool to help automate workflows, integrate with external tools, and improve productivity whether you are a software engineer, analyst, data engineer, or scientist. The platform has a unified, interactive, and collaborative UI with several features encompassing cluster management, multi-cloud data storage connectivity, data visualizations, machine learning models, built-in governance, and programming interfaces for Scala, R, Python, Go, and Java.

We covered a few of the key features of the Databricks SDK with examples, and I hope that this has aided in getting you started with Databricks.