Building Plotly Dash Apps on a Lakehouse with Databricks SQL
Plotly on Databricks Blog Series — Article #1
📌 Questions about how Databricks and Dash work together? Reserve your spot for the live technical session on December 13.
TL;DR — For building Plotly Dash apps on Databricks, the integration process is identical to any data warehouse. Use the databricks-sql python connector (DBSQL) to create a jdbc/odbc connection to a DBSQL endpoint, or use an ORM such as SQLAlchemy. [Both examples will be covered in this blog.] Check out the Github repo & live app! Or, watch the 8-minute tutorial!
Python developers, data scientists and others wanting to connect a Plotly Dash analytics web application front end to a Databricks back end will be well-served by the Databricks SQL connector for Python (DBSQL) which deepens the integration potential between Databricks and Plotly to bring best-in-class interactive, flexible, scalable apps to the analytics community.
This DBSQL client connector allows developers to run queries on Databricks SQL Endpoints, which are highly performant clusters on Databricks, and which are specifically designed for data-warehousing style workloads that have high concurrency and low latency SLAs for SQL-based queries.
This library allows Plotly Dash apps to very quickly run and retrieve results of SQL queries that are executed on a Databricks SQL endpoint, thus allowing Databricks customers to extend and maximize their use of the platform for a wide range of analytics data app use cases, such as:
- Data warehouse use cases, connecting a SQL Endpoint like any data warehouse
- ORM (e.g. SQLAlchemy) integrations to perform advanced use cases and create more intuitive python code
pip install sqlalchemy-databricks
- Advanced waterfall filtering/visualizations
- Dynamic sliders/filters/dependent visuals in a Dash app
- For pushing complex/expensive SQL transformations to the Databricks SQL Photon engine for record-breaking performance
- Real-time streaming dashboards — using DBSQL with Dash’s
dcc.intervalcomponent to live stream updates to your Dash app
[NB: Note that for the above, legacy back-end architectures historically operationalized to integrate Plotly Dash are greatly simplified (and costs reduced) to accommodate such use cases (e.g. avoiding external caching architecture work-arounds by leveraging Delta Caching natively in DBSQL)]
We will walk through building a simple database on Databricks SQL using a sample dataset shipped by Databricks, so you can run the SQL in any Databricks environment. After the database has been created, we will create a Dash app built on top of a Databricks SQL endpoint to deliver and host a full-stack data application.
Feel free to follow along and re-create as you go through the blog by cloning the Git project.
Prerequisites to run this Repo
- A Databricks workspace with DatabricksSQL enabled (DBSQL enabled in Premium Workspaces or above)
- A DBSQL endpoint or Databricks cluster with 9.1 LTS or higher (data engineering cluster)
- A personal access token in Databricks to authenticate to a SQL Endpoint via API
- A Python development environment (>=v 3.8). We recommend VSCode for a local IDE and using conda or virtual env to manage dependencies, as well as black to automatically format your code.
- Spin up a DatabricksSQL (DBSQL) endpoint on either classic or serverless
- Copy and paste the SQL code under
utils/BuildBackendIoTDatabase.sqlinto the DBSQL Query Editor and run it. (Note: You can also run this code in a notebook directly from an imported Repo in Databricks.)
- Clone the Git repo above into your local IDE.
- Install dependencies wherever you are running the Dash app with
pip install -r requirements.txtin your IDE.
- Set environment variables of SERVER_HOSTNAME, HTTP_PATH, ACCESS_TOKEN from your Databricks cluster. You can find this by selecting the SQL endpoint and clicking the “Connection Details” tab in the endpoint UI.
- Run your Dash app on a local server by running
Building the Back End to the Dash App
First, we are going to create a database in Databricks that we will then build a Dash app on top of. Before you get started, ensure that you have the permissions to create a database in your Databricks SQL environment. We will utilize the
databricks-datasets repository of sample data sets that come with Databricks. This data comes in JSON and CSV format, so we can simply run a simple COPY INTO statement to incrementally read in raw data, and insert into a Delta Table. This ETL script will read in IoT data that represents Smartwatch reading from users. Just for fun, and to implement realistic database behavior, we can assume there may be updates to the raw data, so we can end the pipeline with a MERGE INTO statement to handle upsert logic as shown below. We will run this process to create 2 final tables: silver_sensors and silver_users. The ‘silver’ naming here indicates that the data has undergone some basic cleaning and is ready for BI Analytics.
Set Up Plotly UI
Now that we have created our database, we can start building the Plotly Dash app. For this process, we go to our local IDE. The
app.py file is the entry point to this application and contains Dash-specific code for the app layout and interactivity.
The app’s layout is defined through a series of nested components (such as
html.Div) that allow you to create a fully configurable UI. See examples in these docs for more details.
Interactivity is added to your Dash app through callback functions. Callbacks are triggered on specific user actions (such as a button click) and can output changes to elements in your app’s layout. As you’ll see in this example, callbacks are used to trigger data retrieval functions and update the data displayed in the charts when a user selects a dropdown item. You’ll also notice that we load pre-filled filter values from Databricks in a
dbx_utils.get_listofusers() function. This is one of many ways we can pass values into the Dash app from Databricks SQL.
Write Database Connection Functions
Before we set up our callback functions for our Dash app, we first need to write the code that interacts with our database. With our databricks-sql connector, or SQLAlchemy, we can write reusable functions to retrieve the databricks from DBSQL, such as the
get_listofusers function above. It is defined in our
dbx_utils library as follows:
We also generate functions to load a pandas dataframe of results to be put into a chart and served in the app with 2 functions as follows:
So we first define a function to retrieve the results from Databricks, and then pass those results into a chart generation function handled by Plotly Express.
Bonus Features: SQLAlchemy + Real-time Apps
SQL Alchemy: In addition, you can also retrieve results using SQLAlchemy or pandas like you would any other data warehouse. You can run
pip install sqlalchemy-databricks to register the dialect with SQLAlchemy and run queries with an ORM, as seen in this GitHub gist.
If you have real-time data flowing into your DatabricksSQL Data Lakehouse (either via Structured Streaming or Delta Live Tables), you can easily make your Dash App automatically poll for new data to keep the dashboard automatically updated and reactive to real-time data changes. This creates an especially powerful combination with Databricks and Plotly by combining the highly efficient Photon query engine on Databricks with the dcc.interval component on Plotly to create a lean and snappy real-time dashboard, even on extremely large data sets.
To make your Dash app real-time, you can easily combine a
dcc.interval component with a callback in your app, and let Databricks take care of the rest.
Use Functions in Callbacks:
Lastly, to make these functions interactive, we simply wrap them with a callback decorator and assign the inputs/output to the Dash components in the
app.layout portion of the app. In the above example, we generated a scatter plot of our data with dynamic axes. Below is the callback function that ties all functions together and binds them to the UI:
This callback function lives in
app.py and now waits for changes to filters/buttons in the Dash components called
comparison, which are selectors that make the user able to compare metrics in the UI dynamically! In this example, the user can look at a risk score of a cohort of users and identify correlations with independent variables. In the example below, we show a comparison of a cohort of users of their risk score by their weights and cholesterol readings.
Run the App
Now we can finally tie it all together! Running
python app.py in your terminal will start up a local server where you can see the finished product. This version of your app will also automatically update as you make changes to your code. When you’re happy with the app, you can deploy it (e.g. using Plotly’s Dash Enterprise offering) for others on the web to see!
We loved talking about this at the Databricks Data + AI Summit 2022. By leveraging Plotly Dash and Databricks, the possibilities are endless. So, tell us what you think in the comments. Questions? Email email@example.com.