Building Plotly Dash Apps on a Lakehouse with Databricks SQL (Advanced Edition)

Plotly

Published in

Plotly

13 min readApr 26, 2023

Plotly on Databricks Blog Series — Article #4

Maximizing the value of Databricks-enabled Dash apps using new SQLAlchemy integration.

Authors: Sachin Seth (Virago Analytics), Cody Austin Davis (Databricks), Dave Gibbon (Plotly)

TL;DR — For building Plotly Dash apps on Databricks, the integration process is identical to any data warehouse. Use the databricks-sql python connector (DB SQL) to create a jdbc/odbc connection to a DB SQL endpoint, or use an ORM such as SQLAlchemy. Check out the live app!

Preamble

Since publishing the first article in this series, Databricks and Plotly have been working hard to make it both easier than ever for Python developers to connect Plotly Dash web applications to Databricks warehouses and to maximize the potential value of such integrations.

One example is Databricks’ SQL connector for Python (DB SQL) library, which has been recently extended to include SQLAlchemy support and unlocks a host of deepened integration possibilities. This “advanced edition” of our original article incorporates these and other enhancements.

Overview

Building production data applications with Dash & Databricks has never been easier and more powerful.

With the release of the Databricks SQL Connector for Python version 2.4.0, developers building Python applications on Databricks warehouses can now use the power of SQLAlchemy to not only manage their databases with a schema centric SQLAlchemy Core API, but also to push down highly complex SQL queries via the SQLAlchemy ORM. The ORM API builds upon the Core to allow for Python classes to be mapped to database tables and ultimately constructed into SQL queries in terms of user-defined Python objects. What’s more — developers can create persistent metadata objects using the built-in session mechanism to interactively edit databases. These powerful methods of managing/manipulating databases Pythonically make for highly customizable and powerful tools when paired with a Plotly Dash front-end framework.

Indeed, the Dash library is a vehicle toward realizing the full potential of SQLAlchemy, all the while running on specially optimized Databricks SQL endpoints designed for data-warehouse focused workloads. The unique synchronicity brought on by combining these technologies makes a formidable case in favor of using Dash as the de-facto way to build Databricks-powered interactive and scalable production-level data apps.

Details

In this article, we will walk through building an example of a production data application with an emphasis on using the SQLAlchemy ORM to query our database interactively. We will begin with the fundamentals, getting set up, and build our way up to more sophisticated implementations of the SQLAlchemy Dialect introducing alembic migration toward the end. The purpose of this article is the provide the reader with a technical understanding of how to use Dash to explore and manage Databricks Lakehouses using the Databricks SQL connector for Python (DB SQL) and in particular the generational improvement brought on by the compatible SQLAlchemy dialect.

The databricks-sql-connector library allows Plotly Dash apps to quickly run and retrieve results of SQL queries that are executed on a Databricks SQL endpoint, thus allowing Databricks customers to extend and maximize their use of the platform for a wide range of analytics data app use cases, such as:

Data Warehousing Backend — serving an application with DB SQL Serverless or DB SQL Pro in place of more expensive and limiting data warehouses
Real-Time Streaming Dashboards ⁠ — using DB SQL with Dash’s dcc.interval component to live stream updates to your Dash app
DDL / Object Mapping — ORM integrations to perform advanced use cases (such as DDL management / deployment) and create more intuitive Python code
Advanced Data Visualization — Advanced visuals such as waterfall filtering/visualizations, dynamic visualizations, conditional visuals, etc.
Dynamic Application Interaction — Dynamic sliders/filters/dependent visuals in a Dash app
Price / Performance — For pushing complex/expensive SQL transformations to the Databricks SQL Photon engine for record-breaking performance
Embedded ML — Use ML Flow API, Jobs API to train models from apps, and Databricks Serverless Model Serving
2-Way Write Backs — Write back data to Lakehouse directly from the app using an ORM simply and and effectively
ETL Interactive Workflows — Trigger Jobs via Dash callback in app

Prerequisites to run this Repo

A Databricks workspace with Databricks SQL and Unity Catalog enabled (DB SQL enabled in Premium Workspaces or above)
A DB SQL endpoint or Databricks cluster with 11.3LTS or higher (data engineering cluster)
A personal access token in Databricks to authenticate to a SQL Endpoint via API
A Python development environment (>=v 3.8). In this article, we are using the Databricks VSCode extension, so we recommend using Visual Studio Code for a local IDE and using conda or virtualenv to manage dependencies, as well as black to automatically format your code.

High-Level Steps

Spin up a DatabricksSQL (DB SQL) endpoint on either classic or serverless
Clone the Git repo above into your local IDE.
Install dependencies wherever you are running the Dash app with pip install -r requirements.txt in your IDE.
Set environment variables of SERVER_HOSTNAME, HTTP_PATH, ACCESS_TOKEN from your Databricks cluster. You can find this by selecting the SQL endpoint and clicking the “Connection Details” tab in the endpoint UI.
Build back end by running the SQL code under utils/BuildBackendIoTDatabase.sql
Run your Dash app on a local server by running python app.py

Building the Back End to the Dash App

We will re-create the above architecture by simulating a batch-loaded IoT sample dataset that is publicly available in all Databricks workspaces.

Before you get started, ensure that you have the permissions to create a database in your Databricks SQL environment. We will utilize the databricks-datasets repository of sample data sets that come with Databricks. This data comes in JSON and CSV format, so we can simply run a simple COPY INTO statement to incrementally read in raw data, and insert into a Delta Table. This ETL script will read in IoT data that represents smartwatch reading from users. To implement realistic database behavior, we can assume there may be updates to the raw data, so we can end the pipeline with a MERGE INTO statement to handle upsert logic as shown below. We will run this process to create 2 final tables: silver_sensors and silver_users. The ‘silver’ naming here indicates that the data has undergone some basic cleaning and is ready for BI Analytics.

Step 1 — Create Table

Step 2 — Incrementally read in raw data with COPY INTO

Step 3 — Upsert Data into final tables called silver_sensors

Step 4 — Select from the database

The above steps can be executed out of a Databricks notebook but those using Visual Studio Code would be well advised to use the Databricks Visual Studio Code extension to run the BuildBackEndIoTDatabase.sql file directly from the IDE.

Using SQLAlchemy Dialect

With our database built, we are ready to embark on building our production data application. The process for setting up and using the SQLAlchemy Dialect can be broken down into the following high level steps:

Connect to the Database by Creating an Engine
Create Metadata Table Objects in Python (optionally create ddl from metadata object)
Write Python function to Load Table Using Reflection
Return the Query Results as Pandas Dataframe and display in the app

1. Connect to Database and Create Engine

In order to use SQLAlchemy at all, we first need to construct an object called the Engine. The Engine is how we connect to a particular database and so only needs to be defined once per database per application. To create the Engine we use the create_engine() function where the main argument is a string URL which takes the following form:

There are a total of 5 connection variables that you need to store in order to construct the string for the engine:

ACCESS_TOKEN
SERVER_HOSTNAME
HTTP_PATH
CATALOG
SCHEMA

If you have trouble identifying the location of a particular SCHEMA or CATALOG we recommend using the Data Explorer in your Databricks instance for a birds eye view of your databases.

Now that the engine has been defined, we can use it to drive the creation of metadata objects from Python classes.

2. Creating Metadata Table Objects

At the core of SQLAlchemy, and often the main motivation to use it, is the SQLAlchemy Expression Language. The Expression Language is a way of representing database structures and expressions in Python such that resulting SQL expressions are back end neutral and can be executed across different databases so long as they have consistent structure. Those database elements, like tables and columns, are represented as Python objects are known as the metadata. We can create metadata tables in Python by defining a class like the one below:

In this example, we stored our Python tables in a file called ddls.py. The __tablename__ property tells SQLAlchemy to map the rows of the table to the class. The user_id identifies it as the primary key with an integer type. The other 9properties each refer to a column in the table defined by their unique types.

If we want to manage our schemas in our Lakehouse via objects, we can define our tables as engine-agnostic Python classes, and use SQLAlchemy to create our tables for us. This is a very popular method of DDL management, often used in conjunction with migration frameworks like Alembic. For our example below, we will not need this, since we just want to read and query data. For this purpose, we will utilize a functionality called “reflection” in SQL Alchemy. This is when the SQLAlchemy engine automatically reads the metadata from the Databricks Delta Table and creates a Python Metadata Object on the fly so you do not need to define any DDLs in your BI application.

3. Write Python function to Load Table Using Reflection

Here’s where reflection comes in. We can automatically read the database and create/populate tables based on what’s already there on the fly. We do that using a Python function which takes the following form:

There’s a lot going on in the code snippet above, so let’s break it down. After we have defined our y-axis condition, we build a table called “users_table” using reflection. This object is different from the Python objects in our ddls.py file since it is not manually defined. Instead, using reflection, we can automatically detect the columns and row values of our table on the database. This gold-level table is actually downstream of our manually created DDLS, representing aggregates that reflect a smoother picture of the noisy IoT data for our app.

From there we want to gather our column objects from the table into variables. We are going to use the column objects when building our SQL Expression Language based query.

Next, we built out a simple select statement using SQLAlchemy to select certain moving average columns from our table. For this particular function, we have a relatively simple query but this article contains more complicated queries as well which can be used as reference. More on the SQL Expression Language later.

Return Pandas DataFrame

Once we have confirmed that the SQL query is the one we want we are ready to write the results of the query to a Pandas DataFrame using the pd.read_sql_query statements where the arguments are the statement and the engine.

By building this into the function object itself, we are streamlining the process of querying the database significantly. Everytime a callback is made by the user, the application creates a metadata table object and pushes the SQL query down on the newly created table. This allows for efficient, reusable, and more verbose querying from the application without flooding your app with SQL Text.

Using the Expression Language

Using the dialect, we can use the SQL Expression Language to run our queries. In the reflection function from the previous section we have a relatively simple SQL query, but there’s a lot more the Expression Language can do. For our purposes we are going to be using the select() SQL Expression Construct the most because we want to visualize what lives in the database. The Expression Language syntax is very similar to straight SQL by design as it is meant to be used agnostically across databases. The Expression Language queries for this article demonstrate a wide range of functionality and can be used a reference point when converting your own SQL to expression language statements.

It is important to keep in mind, the select expression, and all the other SQLAlchemy expressions, yield SQL statements as a result. We can combine the Expression Language with a read_sql_query to produce Pandas DataFrames. In the next section we will talk about how we can use those data frames in combination with dash to build interactive visuals and push down SQL queries using callbacks.

Building the Plotly UI

Now that the database has been created, SQLAlchemy classes defined, and pandas dataframes returned- we can start building the Plotly Dash app. For this process, we go to our local IDE. So far all the SQLAlchemy we need for the application lives in a dbx_utils.py and ddls.py files. However, the app.py file is the entry point to this application and contains Dash-specific code for the app layout and interactivity.

Dash app foundations

The app’s layout is defined through a series of nested components (such as dcc.Graph or html.Div) that allow you to create a fully configurable UI. See examples in these docs for more details.

Interactivity is added to your Dash app through callback functions. Callbacks are triggered on specific user actions (such as a button click) and can output changes to elements in your app’s layout. As you’ll see in this example, callbacks are used to trigger data retrieval functions and update the data displayed in the charts when a user selects a dropdown item.

By indirectly placing the SQLAlchemy classes in our callback, we can run custom queries on the database interactively with the Plotly UI thereby accessing huge datasets with minimal effort on the front end. We can also pass the results of the query into another function to make use of the Plotly’s signature charting library. In our example app, the SQLAlchemy classes all return Pandas DataFrames so we can use those data frames, combined with Plotly charts, to create callback affected visualizations.

After the chart is defined, we can add it to it in our application by including it in our def make_line() function.

Use Functions in Callbacks

To make these functions interactive, we simply wrap them with a callback decorator and assign the inputs/output to the Dash components in the app.layout portion of the app. In the above example, we generated a scatter plot of our data with dynamic axes. Below is the callback function that ties all functions together and binds them to the UI:

This callback function lives in app.py and now waits for changes to filters/buttons in the Dash components called scatter-x and comparison, which are selectors that make the user able to compare metrics in the UI dynamically! In this example, the user can look at a risk score of a cohort of users and identify correlations with independent variables. In the example below, we show a comparison of a cohort of users of their risk score by their weights and cholesterol readings.

Bonus Features: Write Back + Basic Alembic Workflow

The example we present in this article is just a subset of what is possible with Databricks SQLAlchemy dialect. One of the main advantages of using an ORM is the back-end neutral design pattern. Consequently, a robust migration tool is a necessity. SQLAlchemy uses a library called alembic to perform database migrations. To get started run “alembic init” to generate alembic.ini and env.py files. The file, alembic.ini, does not require any further configuration however the env.py file must be modified to do the following:

import the metadata object against which you declared your models and set the the target_metadata equal to it
update run_migrations_offline to import your connection string and set the url equal to it
update run_migrations_online to use a connectable engine

Our own env.py file would look like this assuming we have stored the necessary information in a main.py file:

Create the initial migration by running alembic revision — autogenerate -m “Initial”. Running the command will generate a fresh revision in the version directory and you should see the model described. Running alembic upgrade head will generate the resulting tables in Databricks.

Using SQLAlchemy we can also perform write back and database schema alterations. Bulk insertion via the SQLAlchemy ORM occurs at a rate of approximately 10 thousand records every 3 seconds.

SQLAlchemy compatible with Databricks, in general, supports the following functionality:

Create and drop tables with SQLAlchemy Core
Create and drop tables with SQLAlchemy ORM
Read created tables via reflection
Modify column nullability
Insert records manually
Insert records with pandas.to_sql (note that this does not work for DataFrames with indexes)

We will examine how to use writeback, alembic, and other SQLAlchemy tools to manage database tables directly in the next article of this series.

Run the App

Now we can finally tie it all together! Running python app.py in your terminal will start up a local server where you can see the finished product. This version of your app will also automatically update as you make changes to your code. When you’re happy with the app, you can deploy it (e.g. using Plotly’s Dash Enterprise offering) for others on the web to see! Look out for our upcoming articles on how to use the SQLAlchemy dialect for Databricks to manage databases directly using Dash.