Molson Coors Streamlines Supply Planning Workflows with Databricks & Plotly Dash
Migrating from Excel using Databricks SQLAlchemy & Plotly Dash AG Grid to deliver editable data apps with write-back capabilities
Plotly on Databricks Blog Series — Article #3
Authors: Casey Emanuel (Molson Coors); Kevin Schmidt (Molson Coors), contributions from Sachin Seth (Lakeside Analytics), Cody Austin Davis (Databricks), Dave Gibbon (Plotly)
The objective for our web app was to replace an existing process which tracks products that have exceeded their target ship date. The legacy process involved loading manual database extractions on a weekly basis into an Excel spreadsheet shared between members of the supply planning team, who would each then manually edit existing or assign new reason codes for each SKU in the dataset. The process involved well over 60 steps, with the majority happening in Excel. This time-intensive, manual process was not only highly susceptible to human error, but simply did not leave enough time for the business to analyze the data and make timely decisions.
With our data recently migrated into an Enterprise Databricks Lakehouse on Delta Lake, our analytics team combined Dash and Databricks to create an interactive Production Data Application built on top of Databricks SQL Serverless that is easy to develop all the while looking like a top-notch web application
Specifically, the two newly-released features that made this possible are Plotly’s new open-source Dash AG Grid data table for clean and crisp tabular edit functionality, along with Databricks’ newly released databricks-sql-connector wchin enables SQLAlchemy to allow for native and scalable write-back functionality.
Ultimately the app cuts the 60+ steps down to fewer than 10, and gives our supply planning team both more confidence in the data, and more time to analyze and act against issues/developments.
In addition to being able to more quickly and reliably manage the manual steps of this process, getting this data into a database will allow for more valuable time-based analytics. The initial use case resulted in a weekly KPI scorecard, automating a majority of this process, and adding better data management is already creating additional demand from the business to do further analytics and identify more ongoing trends. Dash also allows us to keep the user experience in a single pane for data management and analytics.
As previously introduced, Python coders connecting Plotly Dash analytics web application front ends to Databricks back ends to retrieve data are well served by the Databricks SQL connector for Python, which exhibits strong performance and overall ease of use.
However, as requirements scale in size and complexity — and where both read and write-back workflows are envisioned — object-relational mapping (ORM) capabilities should be considered.
The Databricks SQL connector for Python will soon integrate Databricks SQL Alchemy Engine capabilites facilitating a two-way bridge which connects Plotly Dash apps with Databricks Warehouses. Users can not only read data from the warehouse to the Dash app, but also effectively write back to the warehouse at scale.
In this article, we will walk through building a simple database on a Databricks SQL. We will then perform some changes on the data and write the modified DataFrame back to the Databricks database.
Our partners at Molson Coors have kindly donated a sample of their inventory data which we will be using throughout this article to demonstrate what a real life production data application can look like.
Besides being much faster, and generally better suited for large datasets, the upcoming databricks-sql-connector release, version 2.4.0, has added some incredible features that makes building apps on Databricks and managing data on the Lakehouse much easier such as:
- Globally defined engine to read and write large amounts of data from Databricks.
- You don’t need to open a cursor connection every time you want to read data from Databricks. Instead define a SQL alchemy engine using Databricks constants SERVER_HOSTNAME, HTTP_PATH, AND ACCESS_TOKEN then use the engine to write your SQL query directly to a pandas data frame using pandas “read_sql_query” statement.
- The DBX SQL connector is suitable for thousands of records at a time and is slow compared to the SQL Alchemy ORM, where as many rows as necessary can be inserted.
- Ability to manage DDL as objects in SQL Alchemy. Users can now define tables as Python class objects and manage their databases/schemas directly from their ORM using SQLAlchemy / Alembic.Use the same engine to bulk write large amounts of data back to Databricks via the pandas “to_sql” method, or use an ORM to create and drop tables all together.
- Manage data by manually inserting records and modifying column null-ability.
- The SQL alchemy engine includes arguments to manage and modify data both when it is being taken out of Databricks and when writing to Databricks databases. Specifically, users can nullify individual columns and manually add records as needed.
- Seamless and complete integration with Dash AG Grid and other pandas compatible tools like Plotly’s Dashboard Engine.
- Call data into a Dash AG Grid table using just one line of code. No need for extensive schema and column definition inside your dash app. The SQL alchemy engine automatically detects schema and assigns column headers on the grid.
Feel free to follow along and re-create as you go through the blog by cloning the Git project.
Prerequisites to run this Repo
- A Databricks workspace with DatabricksSQL enabled (DBSQL enabled in Premium Workspaces or above)
- A Serverless SQL Warehouse or Databricks cluster with 11.3 LTS or higher (data engineering cluster)
- A personal access token in Databricks to authenticate to a SQL Endpoint via API
- A Python development environment (>=v 3.8). We recommend VSCode for a local IDE and using conda or virtual env to manage dependencies, as well as black to automatically format your code
- Unity Catalog enabled account wide — Unity Catalog is Required for full Data warehousing functionality when using an ORM like SQLAlchemy
- Spin up a DatabricksSQL (DBSQL) endpoint on either classic or serverless
- Copy and paste the SQL code under
utils/BuildBackendIoTDatabase.sqlinto the DBSQL Query Editor and run it. (Note: You can also run this code in a notebook directly from an imported repo in Databricks.)
- Clone the Git repo above into your local IDE.
- Install dependencies wherever you are running the Dash app with
pip install -r requirements.txtin your IDE.
- Set environment variables of SERVER_HOSTNAME, HTTP_PATH, ACCESS_TOKEN from your Databricks cluster. You can find this by selecting the SQL endpoint and clicking the “Connection Details” tab in the endpoint UI.
- Run your Dash app on a local server by running
Building the Back End to the Dash App
First, we are going to create a database in Databricks that we will then build a Dash app on top of. Before you get started, ensure that you have the permissions to create a database in your Databricks SQL environment. Using the SQL Query Editor, we can create a database and subsequent table which the Dash Application can point to. In this article, we are utilizing sample inventory data from Molson Coors. If you want to follow along feel free to utilize the
databricks-datasets repository of sample data sets that come with Databricks just make sure to populate the create table logic with the approriate column names.
Step 2 — Incrementally read in raw data with COPY INTO
Step 3 — Select from the database
While we are in our query editor we are also going to define the table we need for our write back procedure.
Step 4 — Create Table for Write-back
Write Database Connection Functions using SQL Alchemy
We have now created two tables on our database. One table- mc_aggrid_demo.bronze_sensors — contains our source data and the other table — mc_aggrid_demo.write_back — is empty but defined. The goal of our Dash App is to ingest the data from the bronze_sensor table, perform some manipulation, and then write back the modified data back to Databricks. First, We need to write the code that pulls the database data from Databricks into our Dash app. We connect to our database differently from the Databricks SQL connector for Python used in the past. We use the same connection variables however SQL Alchemy takes those variables to construct a globally defined engine. They are defined in our
dbx_utils file as follows:
We can use the engine in conjunction with a SQL statement to quickly define a pandas DataFrame according the following lines of code:
The next section will focus on setting up the Plotly UI and AG Grid.
Set Up Plotly UI and Dash AG Grid
Now that we have created our database and read that data to a pandas DataFrame, we can start building the Plotly Dash app. This particular app will feature some pre loaded charts as well as a Dash AG Grid component to view and manage the Databricks data. For this process, we go to our local IDE. The
app.py file is the entry point to this application and contains Dash-specific code for the app layout and interactivity. (To keep things organized, its best to put all the Databricks-specific connection information in a separate file named
Dash AG Grid Implementation
The app’s layout is defined through a series of nested components (such as
html.Div) that allow you to create a fully configurable UI. See examples in these docs for more details.
In this example, we are using our Dash app to host and interact with an AG Grid component. We can do that by defining the data from our database to a pandas DataFrame using Python. Our
app.py file is as follows
Generate UI (Code)
We can see in the code that the AG Grid components has several arguments, which control its state. We can code in callbacks, which control our AG Grid component directly by utilizing Dash UI control elements in the next section.
Use Callbacks to Enable Interactivity in Dash App:
Interactivity is added to your Dash app through callback functions. Callbacks are triggered on specific user actions (such as a button click) and can output changes to elements in your app’s layout. As you’ll see in this example, we can assign callbacks to UI elements that control which columns to display, sorting options, downloading row data to csv files, and even writing back to Databricks itself! Callbacks are a developers bread-and-butter when building data production apps on Plotly Dash, especially when dealing with large volumes of data, where versioning and control are important.
In fact, we are going to implement all of the functionality listed above through callbacks in this very Dash app. Below is an example of a callback which downloads row data from the AG Grid to csv.
Write Back to Databricks Database Using SQL Alchemy:
Writing back to the database itself is one of the most significant advancements we’ve seen because it enables a true data production workflow, whereby the user can see changes they’ve made in the Dash App carry over to the Databricks database where the data originally came from. To write data back to Databricks all you need is a “to_sql” statement as seen here:
The above line of code can be added to a callback so that it triggers when a button is pressed, for example. However, this method is intended to be a quick work around for very small data sets. For a more robust and scalable implementation we use SQL Alchemy ORM capabilities. SQL Alchemy Databricks is equipped to handle bulk inserting data into Databricks Delta Live Tables where inserting 10 thousand rows of data takes less than 3 seconds. We can create a table and insert records through the ORM as exemplified below:
We can check the Data tabe in our Databricks environment to confirm the write-back worked and see the new data on our write-back table.
Adding SQL Alchemy to the Databricks SQL Connector for Python
SQL Alchemy is the clear next step in the evolution of making sophisticated and scalable production data apps with interactive Dash featured front ends served by Databricks SQL warehouses on the back end. In the past, developers looking to meet these goals were well served by the original Databricks SQL Connector for Python, but it had several constraints. For example, each time users called data from a Databricks database a connection had to be opened and closed in the code manually as seen below:
Best practice dictates that the connection should close once the data has been read but closing the connection had to be hard coded and leaving it open could leave the cluster running unnecessarily creating hidden costs and security risks. Additional security risks as those imposed by SQL Injection are similarly mitigated by using the SQL Alchemy Engine ORM vs querying the database directly. Another advantage of using the ORM over raw SQL queries is in the Alchemy Engine’s ubiquitious application. You can change the database engine without worrying about breakages in query. Moreover, SQL Alchemy enables the user to bulk insert data on the order of 10k records at a rate of less than 3 seconds. The method for bulk inserting is slightly different to the simple insertion discussed above. The generic version for a bulk insert process can be seen here:
Run the App
Now we can finally tie it all together! Running
python app.py in your terminal will start up a local server where you can see the finished product. Any write-back implementation that lives outside of a Dash callback executes when the app.py file is run while those statments inside the callback will wait for some user action to execute. This version of your app will also automatically update as you make exchanges to your code. When you’re happy with the app, you can deploy it (e.g. using Plotly’s Dash Enterprise offering) for others on the web to see!
In the coming months, we’ll be sharing more Databricks-Plotly content via articles, videos, and webinars. (Look out for these on our website!) Interested in learning more about how Databricks SQL Alchemy and Dash work together? Email email@example.com.