Plotly
Published in

Plotly

Building Plotly Dash Apps on a Lakehouse with Databricks SQL

Plotly on Databricks Blog Series — Article #1

Authors: Cody Austin Davis (Solutions Architect, Databricks), Hannah Ker (Solutions Architect, Plotly), contributions from Tammy Do & Daniel Anton Suchy

Dash with Databricks

TL/DR — For building Plotly Dash apps on Databricks, the integration process is identical to any data warehouse. Use the databricks-sql python connector (DBSQL) to create a jdbc/odbc connection to a DBSQL endpoint, or use an ORM such as SQLAlchemy. [Both examples will be covered in this blog.] Check out the Github repo & live app!

Overview

Python developers, data scientists and others wanting to connect a Plotly Dash analytics web application front end to a Databricks back end will be well-served by the Databricks SQL connector for Python (DBSQL) which deepens the integration potential between Databricks and Plotly to bring best-in-class interactive, flexible, scalable apps to the analytics community.

This DBSQL client connector allows developers to run queries on Databricks SQL Endpoints, which are highly performant clusters on Databricks, and which are specifically designed for data-warehousing style workloads that have high concurrency and low latency SLAs for SQL-based queries.

This library allows Plotly Dash apps to very quickly run and retrieve results of SQL queries that are executed on a Databricks SQL endpoint, thus allowing Databricks customers to extend and maximize their use of the platform for a wide range of analytics data app use cases, such as:

  • Data warehouse use cases, connecting a SQL Endpoint like any data warehouse
  • ORM (e.g. SQLAlchemy) integrations to perform advanced use cases and create more intuitive python code pip install sqlalchemy-databricks
  • Advanced waterfall filtering/visualizations
  • Dynamic sliders/filters/dependent visuals in a Dash app
  • For pushing complex/expensive SQL transformations to the Databricks SQL Photon engine for record-breaking performance
  • Real-time streaming dashboards ⁠ — using DBSQL with Dash’s dcc.interval component to live stream updates to your Dash app

[NB: Note that for the above, legacy back-end architectures historically operationalized to integrate Plotly Dash are greatly simplified (and costs reduced) to accommodate such use cases (e.g. avoiding external caching architecture work-arounds by leveraging Delta Caching natively in DBSQL)]

Details

We will walk through building a simple database on Databricks SQL using a sample dataset shipped by Databricks, so you can run the SQL in any Databricks environment. After the database has been created, we will create a Dash app built on top of a Databricks SQL endpoint to deliver and host a full-stack data application.

Feel free to follow along and re-create as you go through the blog by cloning the Git project.

Prerequisites to run this Repo

  1. A Databricks workspace with DatabricksSQL enabled (DBSQL enabled in Premium Workspaces or above)
  2. A DBSQL endpoint or Databricks cluster with 9.1 LTS or higher (data engineering cluster)
  3. A personal access token in Databricks to authenticate to a SQL Endpoint via API
  4. A Python development environment (>=v 3.8). We recommend VSCode for a local IDE and using conda or virtual env to manage dependencies, as well as black to automatically format your code.

High-Level Steps

  1. Spin up a DatabricksSQL (DBSQL) endpoint on either classic or serverless
  2. Copy and paste the SQL code under utils/BuildBackendIoTDatabase.sql into the DBSQL Query Editor and run it. (Note: You can also run this code in a notebook directly from an imported Repo in Databricks.)
  3. Clone the Git repo above into your local IDE.
  4. Install dependencies wherever you are running the Dash app with pip install -r requirements.txt in your IDE.
  5. Set environment variables of HOST_NAME, HTTP_PATH, ACCESS_TOKEN from your Databricks cluster. You can find this by selecting the SQL endpoint and clicking the “Connection Details” tab in the endpoint UI.
  6. Run your Dash app on a local server by running python app.py

Building the Back End to the Dash App

First, we are going to create a database in Databricks that we will then build a Dash app on top of. Before you get started, ensure that you have the permissions to create a database in your Databricks SQL environment. We will utilize the databricks-datasets repository of sample data sets that come with Databricks. This data comes in JSON and CSV format, so we can simply run a simple COPY INTO statement to incrementally read in raw data, and insert into a Delta Table. This ETL script will read in IoT data that represents Smartwatch reading from users. Just for fun, and to implement realistic database behavior, we can assume there may be updates to the raw data, so we can end the pipeline with a MERGE INTO statement to handle upsert logic as shown below. We will run this process to create 2 final tables: silver_sensors and silver_users. The ‘silver’ naming here indicates that the data has undergone some basic cleaning and is ready for BI Analytics.

Step 1 — Create Table

Step 2 — Incrementally read in raw data with COPY INTO

Step 3 — Upsert Data into final tables called silver_sensors

Step 4 — Select from the database

Set Up Plotly UI

Now that we have created our database, we can start building the Plotly Dash app. For this process, we go to our local IDE. The app.py file is the entry point to this application and contains Dash-specific code for the app layout and interactivity.

Dash app foundations

The app’s layout is defined through a series of nested components (such as dcc.Graph or html.Div) that allow you to create a fully configurable UI. See examples in these docs for more details.

Interactivity is added to your Dash app through callback functions. Callbacks are triggered on specific user actions (such as a button click) and can output changes to elements in your app’s layout. As you’ll see in this example, callbacks are used to trigger data retrieval functions and update the data displayed in the charts when a user selects a dropdown item. You’ll also notice that we load pre-filled filter values from Databricks in a dbx_utils.get_listofusers() function. This is one of many ways we can pass values into the Dash app from Databricks SQL.

Generate UI (Code)

Write Database Connection Functions

Before we set up our callback functions for our Dash app, we first need to write the code that interacts with our database. With our databricks-sql connector, or SQLAlchemy, we can write reusable functions to retrieve the databricks from DBSQL, such as the get_listofusers function above. It is defined in our dbx_utils library as follows:

Set variables

Import connector

Connect

We also generate functions to load a pandas dataframe of results to be put into a chart and served in the app with 2 functions as follows:

Get data from Databricks SQL Endpoint

Pass data into a function to generate chart in Plotly

So we first define a function to retrieve the results from Databricks, and then pass those results into a chart generation function handled by Plotly Express.

Bonus Features: SQLAlchemy + Real-time Apps

SQL Alchemy: In addition, you can also retrieve results using SQLAlchemy or pandas like you would any other data warehouse. You can run pip install sqlalchemy-databricks to register the dialect with SQLAlchemy and run queries with an ORM, as seen in this GitHub gist.

Real-time Apps

If you have real-time data flowing into your DatabricksSQL Data Lakehouse (either via Structured Streaming or Delta Live Tables), you can easily make your Dash App automatically poll for new data to keep the dashboard automatically updated and reactive to real-time data changes. This creates an especially powerful combination with Databricks and Plotly by combining the highly efficient Photon query engine on Databricks with the dcc.interval component on Plotly to create a lean and snappy real-time dashboard, even on extremely large data sets.

To make your Dash app real-time, you can easily combine a dcc.interval component with a callback in your app, and let Databricks take care of the rest.

Step 1 — Add dcc.interval component to app.layout

Step 2 — Add callback trigger

Use Functions in Callbacks:

Lastly, to make these functions interactive, we simply wrap them with a callback decorator and assign the inputs/output to the Dash components in the app.layout portion of the app. In the above example, we generated a scatter plot of our data with dynamic axes. Below is the callback function that ties all functions together and binds them to the UI:

This callback function lives in app.py and now waits for changes to filters/buttons in the Dash components called scatter-x and comparison, which are selectors that make the user able to compare metrics in the UI dynamically! In this example, the user can look at a risk score of a cohort of users and identify correlations with independent variables. In the example below, we show a comparison of a cohort of users of their risk score by their weights and cholesterol readings.

Run the App

Now we can finally tie it all together! Running python app.py in your terminal will start up a local server where you can see the finished product. This version of your app will also automatically update as you make changes to your code. When you’re happy with the app, you can deploy it (e.g. using Plotly’s Dash Enterprise offering) for others on the web to see!

Sample App Live Demo

Next steps

We loved talking about this at the Databricks Data + AI Summit 2022. By leveraging Plotly Dash and Databricks, the possibilities are endless. So, tell us what you think in the comments. Questions? Email info@plotly.com.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store