Databricks SDK + Plotly Dash — the easiest way to get Jobs done

Published in

Plotly

8 min readNov 9, 2023

📌 Plotly on Databricks Blog Series — Article #6 📌

Author: Cal Reynolds (Plotly)
Contributors: Cody Austin Davis (Databricks), Sachin Seth (Lakeside Analytics), Dave Gibbon (Plotly)

Resources: Github Repository

TL;DR — Dash can be used as a front-end platform for the Databricks SDK, and in turn, the Jobs API. Here, we run a demand forecasting notebook on Databricks from the comfort of a Dash web application built with Python.

Content Plug

If you haven’t already, check out our Plotly Dash / Databricks webpage for more interesting use cases, points of inspiration, and joint customer stories.

We encourage you to hear directly from our customers (Molson Coors, Collins Aerospace, Holmusk, Ballard Power Systems, and more).

Introduction

In my role at Plotly, I assist organizations of all sizes and industries in addressing their pressing data challenges. A common theme I observe is that teams are searching for a means of empowering technical and non-technical users with the full force of the data stack and computational power that they have invested heavily in.

A standout solution to this challenge is the combination of Databricks and Plotly’s Dash. Much like a peanut butter and jelly sandwich, this duo epitomizes the perfect blend, uniting the best of both worlds in the data ecosystem.

In this partnership:

Databricks provides the platform — a unified plane where data professionals can harness vast computational power, leverage, and train state-of-the-art ML models, store significant volumes of business-critical data securely, and much more.
Plotly’s Dash library and Dash Enterprise provide a sleek front-end interface — a control panel designed such that even the least technically-inclined users can derive meaningful insights and value from their data, models, and computational resources.

Background

The Databricks SDK is a relatively new offering that gives users the capability to interact with the platform in a multitude of ways through Python. It consolidates a wide range of features, such as manipulating DBFS, modeling, running scripts, and notebooks. In this article, our primary focus will be its interface with the Databricks Jobs API.

The Databricks Jobs API (aptly named) allows end-users to kick off jobs that can manipulate, run, or create data workflows. In this article, we leverage the Jobs API to run a machine learning workflow in a Databricks notebook, all from the front-end comfort of a Plotly Dash app.

In the end, although this article’s demo is short and sweet in terms of lines of code, its applications are vast. By integrating the Jobs API with a Plotly Dash app, your team stands to benefit in several ways:

Reduction in manual coding work over time.
Mitigation of scope creep and reduced model recompilation requests from non-technical teams.
Productionization of Databricks notebook workflows via a full-stack web application interface.

To get your hands on this demo

Check out this article’s associated GitHub Repository to get started. In it, you will find a comprehensive guide for installation in the README.md.

This project utilizes Databricks sample data that is included with every Databricks workspace. As such, it should be immediately implementable.

If you have any questions whatsoever or run into issues with its installation, please feel free to open an issue in the GitHub Repository.

The Project Walkthrough

To showcase the integration between Plotly Dash and the Databricks SDK and Jobs API, I created a mock forecasting notebook using Databricks’ retail sample dataset (named “retail-org”). The primary objective of this notebook is to forecast the number of product units required for future stocking, based on historical demand patterns.

Below, I’ve outlined the component architecture of this demo:

Here’s a step-by-step description of the workflow, where 1, 2, and 3 below correspond to the arrows in the above visual:

Our Dash application, utilizing the Databricks SDK, initiates a call on user click to the Databricks Jobs API. This call prompts the execution of our demand forecasting notebook on Databricks. Dynamic parameters sourced from our Dash app are supplied for the forecast.
Once the forecast is completed within the notebook, the results are visualized using a Plotly scatter chart. This scatter chart, crafted in the Databricks notebook, is then exported as a JSON file to Databricks File Storage (DBFS).
With the forecasting process finalized and the Plotly graph stored in DBFS, our Dash app retrieves the graph from Databricks. This visualization, displayed within the app, is not only fully interactive but also retains the precise styling of the original graph from the Databricks notebook.

The Nitty Gritty Details

If you aren’t interested in delving into detailed code samples and a deeper exploration, feel free to skip to the “The Nitty Gritty Details: TL;DR” section below.

In the following section, we’ll spotlight three essential facets of this workflow, supplemented with code snippets.

Notebook parameterization

Utilizing the dbutils library in Databricks, users can create interactive fields in their notebooks. These fields not only allow notebook users to re-parameterize their models easily, but they also open the door for dash applications to supply dynamic parameters directly to Databricks.

This dynamic parameterization is crucial to Dash’s ability to reduce the notebook maintainer’s workload over time. Instead of having to manually recompile and re-parameterize their notebook each time they get a new request from the business, that notebook creator can serve the dash app with dynamic inputs directly to other teams.

— Below, observe how you can use the dbutils.widgets.text() command to create text input fields in a Databricks notebook. When a user hits “Run Job” in the below Dash application, Dash will not just run the Databricks notebook, but also provide whatever inputs a user has entered as parameters.

— You can then utilize these dynamic parameters throughout your notebook for filtering data or assigning variables with the following get() command (highlighted below):

2. Databricks SDK in a Dash App for running Jobs

In this subsection, we cover how to leverage the Databricks SDK from our Dash app in a callback function. Callback functions in Dash allow applications to dynamically run Python functions based on user input from the UI. We will break down the callback function in this project into logical blocks in the following paragraphs:

@callback(
    Output("loading-form", "children"),
    Output("forecast-plot", "children"),
    State("state-dropdown", "value"),
    State("forecast-forward-days", "value"),
    Input("jobs-api-button", "n_clicks"),
    prevent_initial_callback=True,
)
def invoke_jobs_api(state, forecast_days, n_clicks):
    # Null check - prevents app from kicking off a job using dash hot-reloading during dev.
    if n_clicks == 0:
        return no_update, no_update

    # Initialize Databricks SDK client. Will search for cluster credentials in /.databrickscfg
    w = WorkspaceClient()

    # Spin up cluster if it's down. If your config isn't working correctly, output error msg.
    try:
        w.clusters.ensure_cluster_is_running(os.environ["DATABRICKS_CLUSTER_ID"])
    except:
        print(
            "Your connection to databricks isn't configured correctly. Revise your /.databrickscfg file"
        )

— First, we instantiate and invoke the Databricks SDK by initializing the WorkspaceClient() object. This object looks for Databricks credentials natively at the user’s /.databrickscfg file on their machine. (line 15 above)
— Next, we run the ensure_cluster_is_running(cluster_id) API to spin up the specified Databricks cluster (whose ID is specified in the project’s .env file). (line 19 above)

# Pass parameters from dash into the Databricks notebook.
params_from_dash = {"us-state": state, "forecast-forward-days": forecast_days}

# Location of the Databricks notebook on your Databricks Instance.
#  Replace this with your notebook name
notebook_path = f"/Users/{w.current_user.me().user_name}/" + notebook_name

# Configure our Databricks job to
created_job = w.jobs.create(
    name=f"sdk-{time.time_ns()}",
    tasks=[
        jobs.Task(
            description="Run Jobs API Notebook",
            existing_cluster_id=os.environ["DATABRICKS_CLUSTER_ID"],
            notebook_task=jobs.NotebookTask(
                notebook_path=notebook_path, base_parameters=params_from_dash
            ),
            task_key="test",
            timeout_seconds=0,
        )
    ],
)

# Run the Databricks Job on your cluster.
w.jobs.run_now(job_id=created_job.job_id).result()

— After that, we’ll need to create and run the job. To do so, we utilize the DBX SDK’s run_now() command, and pass in our dynamic Dash parameters to feed into the notebook as well as the path to the notebook on our Databricks Workspace.

— At this point, we can navigate to the “Job Runs” tab in the Databricks navigation bar to verify that the job has kicked off correctly from the Dash app. Example below:

A UI that looks something like this means that your job has succeeded!

3. Databricks SDK for file movement

In this workflow, we transfer a Plotly chart from our Databricks notebook into DBFS. Then, we transfer the widget from DBFS to be displayed in our Dash app. We therefore effectively preserve the hard work of the notebook user who put time and energy into their exploratory data analysis, modeling, and charting. Let’s dive into how this works.

— Given the Databricks notebook we provide with this project runs correctly, it should output something that looks like the following:

Note: this forecast will look different if you utilize different input parameters to the notebook.

— In this case, since we have spent a lot of lines of code in our notebook to make our forecast scatter chart look professional, we want to preserve that work when we move it to our Dash app.
— So, to transfer the above widget to our Dash application with styling preserved, we will directly write it to DBFS using the following code in our Databricks notebook.

— When we see that our JSON was written to DBFS successfully, Dash will be waiting to ingest the Plotly widget JSON file. (starting from line 1 in the code sample below).
— Then, we decrypt the JSON and return it from our callback function to our application’s layout as a Plotly graph object, as seen below in lines 5–20:

fig_bytes = w.dbfs.read("/tmp/forecast_plot.json")

# Extract the content from the response
content = fig_bytes.data

# Decode the byte content to get a string
decoded_content = base64.b64decode(content).decode("utf-8")

# Now, you can use decoded_content as a regular string.
w.jobs.delete(job_id=created_job.job_id)

# Load the decoded content into a Python dictionary
fig_data = json.loads(decoded_content)

# Convert the dictionary to a Plotly Figure
fig = go.Figure(fig_data)

return no_update, dcc.Graph(
    figure=fig,
)

The Nitty Gritty Details — TL;DR:

Notebook parameterization enables Dash to provide dynamic inputs from application to notebook.
Callback functions in Dash enable the web application’s ability to kick off Python code, notably the Databricks SDK.
The Databricks SDK helps in both our notebook and within our Dash app to move a fully styled Plotly visualization from one to the other, reducing manual transfer work.

Conclusion

Dash can serve as an effective control panel for the Databricks SDK, and in turn reduce manual workflow significantly by data engineers and scientists by empowering them and their teams with dynamic, interactive, full-stack Python data applications.

That was a mouthful. If you remember anything from this series, it should be this:

Plotly Dash and Databricks go together like PB&J — they’re better together!

Thank you for taking the time to read this article. If you have any questions or want to talk about how this workflow may apply to your team’s needs, please reach out to info@plotly.com.