Build Real-Time Production Data Apps with Databricks & Plotly Dash
Plotly on Databricks Blog Series — Article #2
📌 To learn more about creating streaming data apps with Databricks and Dash, watch the recorded technical session.
TL;DR — The process for building at-scale interactive Plotly Dash analytics apps for streaming data via Databricks is accomplished by using the Databricks Structured Streaming solution in conjunction with the Databricks SQL python connector (DB SQL).
In Article #1 of this series, we walked through the process of connecting a Plotly Dash app front end to a Delta Lakehouse served from a Databricks SQL warehouse.
In particular, we used the Databricks SQL connector for Python (DB SQL) within a Plotly Dash app to run queries on Databricks SQL endpoints, demonstrating the efficacy of this method for managing data-warehousing style workloads with high concurrency and low latency SLAs.
This same workflow can be used to build at-scale interactive Plotly Dash analytics apps for streaming data, including workflows processing extremely large data sets in real time.
The purpose of this article is to cover the high-level processes for building such a streaming Plotly Dash application. Specifically, we will cover the steps for building real-time IoT data applications for any use case by leveraging Databricks’ Structured Streaming solution.
Why Databricks for Streaming Data?
Spark Structured Streaming on Databricks is a leading solution for architecting real-time streaming pipelines, and Structured Streaming pipelines can be utilized to simplify an ETL architecture to leverage the same paradigm for batch processing and real-time workloads.
Before data even reaches any DB SQL Warehouse, Databricks uses its Auto Loader engine within a Structured Streaming application, making it capable of ingesting billions of files efficiently and in real-time while minimizing computational overhead. For this reason, the Databricks Delta Lake Structured Streaming pipeline is well suited for workflows which need to process extremely large volumes of data, but which at the same time need to maintain speed and keep costs low. Auto Loader provides the further benefit of being able to handle changing or drifting file types through schema inference and evolution support. Moreover, Auto Loader will also notify you when schema changes happen so you can rescue data that would have otherwise been lost, making this solution ideal for production projects that require ingesting many different file types and formats all at once.
In this solution, we begin by ingesting raw Raspberry Pi data simulating a BME280 sensor into the cloud (Azure Data Lake Storage in this case) based storage for constructing a database in Databricks. Once the stream is ingesting new data and the database has been created, we connect a Plotly Dash Application to the DB SQL endpoint and begin serving our streaming application.
- Data Source Integration: Gather data from Raspberry Pi Azure IoT Online Simulator and insert raw data into Azure Data Lake Storage bucket
- Staging (Bronze): Incrementally read raw data from ADLS container (or S3 bucket) and insert into Bronze Delta Table
- Databases (Silver): Aggregate streaming data from Bronze Delta Table and use Watermarking to manage aggregates and late arriving data from system
- Analytics (Gold): Create a Gold-level view that smooths the silver aggregated data into easier to read moving averages (raw IoT data is extremely messy)
- Front End: Set up Plotly Dash app UI
- Connect: Connect the Dash app to the Databricks database. Using callbacks and the Databricks SQL connector for Python, we can connect the backend lakehouse to the Dash app.
- Run the app!
- A Databricks workspace with Databricks SQL enabled (DB SQL is enabled by default in Premium Workspaces or above)
- A DB SQL endpoint or Databricks cluster with 9.1 LTS or higher (data engineering cluster)
- A personal access token in Databricks to authenticate SQL Endpoint via API
- A Python development environment (>=v 3.8). We recommend VSCode for a local IDE and using Conda or virtual env to manage dependencies, as well as
blackto automatically format your code.
- Spin up a Databricks SQL (DB SQL) endpoint on either classic or serverless
- Copy and paste the SQL code under
utils/BuildBackendIoTDatabase.sqlinto the DB SQL Query Editor and run it. (Note: You can also run this code in a notebook directly from an imported Repo in Databricks)
- Clone the Git repo above into your local IDE (e.g. Plotly’s Dash Enterprise solution incorporates a web-based IDE Workspace experience).
- Install dependencies wherever you are running the Dash app with
pip install -r requirements.txtin your IDE.
- Set environment variables of
ACCESS_TOKENfrom your Databricks cluster. You can find this by selecting the SQL endpoint and clicking the “Connection Details” tab in the endpoint UI.
- Run your Dash app on a local server by running
1. Data Source Integration
Gather data from Raspberry Pi Azure IoT Online Simulator
You can follow the documentation below to create your own Azure IoT Hub and begin streaming sensor data from a simulated Raspberry environment.
- Connect Raspberry Pi online simulator to Azure IoT Hub (Node.js)
- Create a Stream Analytics job by using the Azure portal
- Set and retrieve a secret from Azure Key Vault using the Azure portal
2. Staging (Bronze)
Read stream from cloud storage
After successfully ingesting the Azure Iot Hub data into your ADLS Blob storage container, you are ready to bring the data into Databricks for staging. Let’s build a database in Databricks to serve as the foundation for the Production Data Application (PDA). Databricks can process data formats of all types, but this particular dataset comes as a JSON file type. After initial service credentialing has been completed, you are ready to code a streaming input DataFrame leveraging both the Apache Spark Structured Streaming engine and the Databricks Auto Loader job jointly. Auto Loader allows for levels of scalability and performance that are out of reach when incrementally reading files in Spark alone. Once the streaming input DataFrame has been defined, we can then write that stream to a Delta table which lives within our database.
3. Databases (Silver)
Load Delta table as a stream source
The Raspberry-Pi sensor samples data at a rate of 400 Hz, which is much too noisy to visualize in a Data App, thus we need to roll up the data to the level of granularity we want our Production Data App to use. To do this, you can utilize watermarking using Databricks’ structured streaming capabilities to account for late-arriving data while also aggregating the windowed states without holding the infinitely growing table in memory.
We can accomplish all this with just a few lines of code like so:
4. Analytics (Gold)
Create gold view
Once the stream is enabled for writing to the Delta table within the database, we can construct a gold view that queries the Delta table to calculate moving averages. This way the Plotly Dash app can easily connect to the results of the query by utilizing a simple
SELECT * FROM gold_sensors. (More about connecting Databricks to a Dash app in the next section.)
5. Building the Front End for the Production Data Application
Set up Plotly UI
Now that the database has been constructed, we can start moving the data into the Plotly Dash app. For simplicity, we go to our local IDE and bring in the
app.py file, which will serve as an entry point for the application. The
app.py file contains Dash-specific code related to the layout and overall interactivity of the app. Callback functions are used to add interactivity to the application and can be triggered on specific user actions or automatically via an interval component. In this example, several callback functions are utilized to stream data.
6. Connect App and Run
Connect Dash App to Databricks database
The pipeline itself has been built in such a way that the stream writes directly to our Delta Live Table, and we can then call data from that Delta table into the Dash app by using the Plotly Dash inbuilt callback architecture. The code which interacts with the database is defined in the
dbx_utils library as follows:
Step 1: Set Variables
Step 2: Import Connector
Step 3: Connect
Once the connection has been established we can make our Dash app automatically poll for new data to keep the dashboard automatically updated by combining a
dcc.interval component with a callback function.
- Step 1: Add
- Step 2: Add callback trigger wrapped in function
7. Run the App
Run the app by using the
python app.py command in your local IDE.
We’re excited to continue the Plotly on Databricks series. Stay tuned for more content. Tell us what you think in the comments or reach out to firstname.lastname@example.org.