Visualizing a Billion Points: Databricks SQL, Plotly Dash… and the Plotly Resampler

Plotly

Published in

DBSQL SME Engineering

11 min readOct 31, 2023

Plotly On Databricks Blog Series — Article #5

📌 To learn more about creating production data apps with Databricks and Dash, watch the recorded technical session.

Authors: Sachin Seth (Plotly), Dave Gibbon (Plotly), Cody Austin Davis (Databricks)

Introduction

Late last year, we published an article entitled “Building Real-Time Production Data Apps with Databricks and Plotly Dash,” where we built an example of a real-time IoT data application using the Databricks SQL Connector for Python. It used standard Plotly charting methods, and showcased inbound streaming data in appending increments. Building upon that foundation, in this article we focus on visualizing substantial time-series datasets, a critical step in unearthing deeper insights and trends embedded within Industrial IoT and other streaming use cases.

TL;DR — The process for building at-scale interactive Plotly Dash analytics apps for large quantities of industrial IoT time-series data in conjunction with the Databricks SQL Python connector (DB SQL). In Figure 1 above, nearly 200 million data points are being visualized simultaneously. Check out the GitHub Repo here.

The objective of this article is to provide an overall walkthrough of how we were able to visualize super large time-series data stored in Databricks using cutting edge aggregation and downsampling techniques found in the Plotly Resampler, and ultimately combine these two technologies to serve a Dash powered, Polars optimized, Plotly production data application.

In an effort to pick up where we left off, we will be employing Industrial IoT sensor readings which simulate an automobile. This approach not only facilitates a more detailed exploration of underlying patterns and trends but also offers a scalable solution for managing and interpreting data at an industrial scale. In the process, we will also go over some of the challenges posed by working with large time-series datasets and in general an overview of what it takes to query and visualize 10¹⁰ individual data points at a time.

Why Use the Plotly Resampler

The Plotly Resampler (https://arxiv.org/abs/2206.08703; Jonas and Jeroen Van Der Donckt) is an add-on for Plotly’s Python bindings, enhancing line chart scalability on top of an interactive toolkit (powered by Plotly + Dash!) by aggregating the underlying data depending on the current graph view. This can fundamentally change your experience when seeking to visualize large sets of sequential data.

Here’s why we decided to use it for our Industrial IoT streaming data example:

Performance enhancement for large datasets: By default, Plotly chart interactivity works best when handling datasets <= 100,000 data points. For large-scale time-series data where you want to interactively engage with all of the data, the Plotly Resampler dynamically aggregates data respective to the current scale, ensuring smoother interactions and faster load times.
Dynamic interaction with data: The tool utilizes callbacks to update and aggregate data as you interact with the plot, whether you’re zooming or panning, making the exploration of large datasets more intuitive and responsive.
Customizable aggregation algorithms: It offers interfaces for a variety of sequence aggregation methods, allowing you to choose or develop an algorithm that suits your data visualization needs.
Enhanced user experience: The Resampler alters the behavior of double-clicking within a line-chart area to trigger an “Autoscale” event.

Prerequisites

This article is primarily written for those who have large quantities of time- series data sitting in either ADLS2 Blob Storage or AWS S3 Buckets and are considering methods to visualize and otherwise explore their data. If you don’t have data of your own, you can use the Raspberry Pi Azure IoT Online Simulator to stream simulated data into Blob Storage (detailed in Article #2) or feel free to use the CSV included in the repo which you can upload to Databricks and write to Delta tables using the accompanying SQL or PySpark module.

In addition to a Delta table containing a time-series dataset, you will need the following:

A Databricks workspace with Databricks SQL enabled (DB SQL is enabled by default in Premium Workspaces or above)
A DB SQL endpoint or Databricks cluster with 9.1 LTS or higher (data engineering cluster)
A personal access token in Databricks to authenticate SQL Endpoint via API
A Python development environment (>=v 3.10). We recommend VSCode for a local IDE and using Conda or virtual env to manage dependencies, as well as black to automatically format your code.

High-Level Steps

For the purposes of this article, we are going to assume that the reader already has time-series data stored in the cloud, or has used one of the data generation methods described in the prerequisite section to create a dataset. From there the steps we followed were:

Use the Medallion Architecture to create a pipeline which ingests raw IoT data from the cloud/source and writes to three stages of Delta
Use the Databricks SQL Connector for Python to query the Delta table on the SQL Warehouse
Funnel data from SQL Warehouse into Arrow files (use Parquet if you want to replicate on your end) which live on the persistent file storage system offered with Dash Enterprise (similarly can just be stored in the root directory of your project folder at the cost of memory)
Load data from Arrow files into a Polars optimized version of the Plotly Resampler for visualization (or parquet files into the regular Plotly Resample)

For a more detailed breakdown of the Medallion Architecture refer to Article #2.

Step 1: Write to Delta

This section is intended for those who do not already have a large time- series dataset to visualize. If you already have your measurement data written to Delta tables you can skip to Step 2: Query SQL Warehouse.

As previously established, a best practice industrial IoT pipeline uses the Medallion Architecture to pre-proccess data in accordance with the diagram below

Figure 2: The proccess above outlines a Medalliion Structure Pipeline highlighting the importance of first staging the data, then computing moving averages, and then doing analytics on the data in that order.

For the purposes of this article its safe to treat the CSV generated by the Python script as your gold layer. Once the CSV is uploaded to your Databricks File System, you can write its contents to a Delta table using the SQL like in the previous article or even PySpark. The PySpark example is provided below and in the accompanying GitHub repo.

%python
# File location and type
file_location = "/FileStore/auto_iot_sensor_data.csv"
file_type = "csv"

# CSV options
infer_schema = "false"
first_row_is_header = "true"
delimiter = ","

# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

display(df)

permanent_table_name = "main.resamplerdata.auto_iot_data_bronze_sensors"
df.write.format("delta").saveAsTable(permanent_table_name)

Figure 3: An expeditious way to write data from CSV files in Databricks to Delta tables in a SQL Warehouse

In Article #2, our gold layer contained moving averages over 15s and 60s to represent our data, however, since we are using the Plotly Resampler- we can accommodate much higher noise levels than usual without a loss to performance. For example, depending on the size of your dataset you may be able to plot your data without having to first represent it as moving averages at all, or at the very least over much smaller intervals than would have been otherwise possible.

Step 2: Query SQL Warehouse

After the pipeline has been built and data is being written to the Delta tables, the next step is using the Databricks SQL Connector for Python to query the SQL Warehouse and return the results of the query as a Parquet file within our Dash application. The following code describes how to create a connection engine using SQLAlchemy and your Databricks credentials. If you are following along user the GitHub repository, all you have to do is add you Databricks credentials to the .env file in the project folder.

from sqlalchemy import create_engine
import pandas as pd

engine_url = f"databricks://token:{token}@{host}/?http_path={path}&catalog=main&schema=information_schema"
engine = create_engine(engine_url)

# # Fetch data from the database
engine_temp_stmt = "SELECT Timestamp, EngineTemperature_C FROM main.resamplerdata.auto_iot_bronze_sensors ORDER BY Timestamp ASC LIMIT 10000000;"
oil_pressure_stmt = "SELECT Timestamp, OilPressure_psi FROM main.resamplerdata.auto_iot_bronze_sensors ORDER BY Timestamp ASC LIMIT 10000000;"
tire_pressure_stmt = "SELECT Timestamp, TirePressure_psi FROM main.resamplerdata.auto_iot_bronze_sensors ORDER BY Timestamp ASC LIMIT 10000000;"
battery_voltage_stmt = "SELECT Timestamp, BatteryVoltage_V FROM main.resamplerdata.auto_iot_bronze_sensors ORDER BY Timestamp ASC LIMIT 10000000;"


engine_temp_df = pd.read_sql_query(engine_temp_stmt, engine)
oil_pressure_df = pd.read_sql_query(oil_pressure_stmt, engine)
speed_df = pd.read_sql_query(speed_stmt, engine)
tire_pressure_df = pd.read_sql_query(tire_pressure_stmt, engine)
battery_voltage_df = pd.read_sql_query(battery_voltage_stmt, engine)

engine_temp_df.to_parquet(f"backend-data/engine_temp_df.parquet")
oil_pressure_df.to_parquet(f"backend-data/oil_pressure_df.parquet")
speed_df.to_parquet(f"backend-data/speed_df.parquet")
tire_pressure_df.to_parquet(f"backend-data/tire_pressure_df.parquet")
battery_voltage_df.to_parquet(f"backend-data/battery_voltage_df.parquet")

Figure 4: Code demonstrating how to pull data out of Databricks using SQLAlchemy engine into Parquet files.

In our enterprise example, we store the data in arrow files inside of a persistent file storage system offered with Dash Enterprise. The code above however, demonstrates how to store the data from Databricks into Parquet files in a folder within our project called back-end data. In this case we partitioned and stored the data by column. This protocol of caching the queried data prevents moving data unnecessarily over the network. Furthermore, the app can be made to ingest only new data according to a schedule, thereby reducing unnecessary bandwidth usage, enhancing performance, and optimizing resource allocation.

For more optimized SQL workflows, consider using SQL Alchemy dialect for the added performance and security benefits as detailed in Article #4.

Step 3: Plotly Resampler

Now that the data has been queried and cached, we can finally load our Arrow files into the Resampler for visualization. After our data comes out of Databricks it is stored as Arrow files in our persistent file system. Under the hood, the Resampler is using Dash callbacks to register inputs and dynamically update the figure, as demonstrated in the code below:

# --------- graph construction logic + callback ---------
@app.callback(
    [
        Output("coarse-graph", "figure"),
        Output("plotly-resampler-graph", "figure"),
        ServersideOutput("store", "data"),
    ],
    [Input("plot-button", "n_clicks"), *get_selector_states(len(name_folder_list))],
    prevent_initial_call=True,
)
def construct_plot_graph(n_clicks, *folder_list):
    it = iter(folder_list)
    file_list: List[Path] = []
    for folder, files in zip(it, it):
        if not all((folder, files)):
            continue
        else:
            files = [files] if not isinstance(files, list) else file_list
            for file in files:
                file_list.append((Path(folder).joinpath(file)))

    ctx = callback_context
    if len(ctx.triggered) and "plot-button" in ctx.triggered[0]["prop_id"]:
        if len(file_list):
            # Create two graphs, a dynamic plotly-resampler graph and a coarse graph
            dynamic_fig: FigureResampler = visualize_multiple_files(file_list)
            coarse_fig: go.Figure = go.Figure(
                FigureResampler(dynamic_fig, default_n_shown_samples=3_000)
            )

            coarse_fig.update_layout(title="<b>coarse view</b>", height=250)
            coarse_fig.update_layout(margin=dict(l=0, r=0, b=0, t=40, pad=10))
            coarse_fig.update_layout(showlegend=False)
            coarse_fig._config = coarse_fig._config.update(
                {"modeBarButtonsToAdd": ["drawrect"]}
            )

            dynamic_fig._global_n_shown_samples = 1000
            dynamic_fig.update_layout(title="<b>dynamic view<b>", height=450)
            dynamic_fig.update_layout(margin=dict(l=0, r=0, b=40, t=40, pad=10))
            dynamic_fig.update_layout(
                legend=dict(
                    orientation="h", y=-0.11, xanchor="right", x=1, font_size=18
                )
            )

            return coarse_fig, dynamic_fig, dynamic_fig
    else:
        return no_update

Figure 5: Callbacks demonstrating Resampler logic

There are a number of ways to use and further customize the Plotly Resampler; in this demo we are featuring a page with a simultaneous coarse and fine-grained view, a figure which supports multiple overlaying y-axis, and a page which enables the user to add traces as needed for comparison. One of the key features of Plotly Resampler is the ability to trigger updates to the graph based on user input. So, for example, you can see in coarse/fine-grain view, changes to the coarse graph trigger updates to fine-grained graph, automatically adjusting the view. In addition, for the purposes of our article we are passing a Polars dataframe into the Plotly Resampler for improved performance vs pandas- look out for a future article featuring true 0 MB visualizations using Polars and Plotly Resampler!

Figure 6: In this figure we can see two charts one which dynamically updates the other. Here we are visualizing a 6 month time period containing 10 million unique points and are able to zoom to the sub-second in real time.

Just like any Dash chart we can also completely customize the colors, themes, line sizes, legend visibility for added readability and aesthetic. Ultimately, using the Resampler we can sift through a dataset containing 6 months of time-series data- as many as 50 millions rows (or more depending on available memory)- and zoom in to the sub-second level with almost no latency. Furthermore, it facilitates the incorporation of multiple y-axes, fostering a rich cross-analysis of insights

Raising the Stakes with Polars

Out of the box, the Plotly Resampler performs very well for datasets containing around 150 million to 200 million points. If the datasets get much larger however, memory constraints, understandably, become an issue. Enter Polars. Using Polars dataframes instead of Pandas inside of the Resampler library led us to a significant reduction in resources consumed to the point we could visualize datasets with as many as 1 billion points. This allows for much larger datasets to be processed efficiently without running into memory bottlenecks. But how does Polars achieve this feat?

Polars, an alternative to Pandas, is engineered specifically for high-performance, multi-threaded data processing. A few reasons why Polars outperforms Pandas, especially for large datasets are:

Lazy evaluation: Polars employs lazy evaluation, which means computations aren’t executed immediately when an operation is called. Instead, they’re queued up and executed in a single pass. This optimizes the sequence of operations and reduces unnecessary intermediate calculations.
Arrow memory layout: Polars uses Apache Arrow for its memory layout. Arrow, a cross-language development platform, offers columnar memory, ensuring that data is stored contiguously in memory, leading to faster processing times.
Built-in multi-threading: While Pandas might require manual parallelization for optimal performance, Polars automatically utilizes multi-threading, maximizing the CPU utilization for data operations.

In the context of the Plotly Resampler, these performance enhancements become critical. When dealing with large datasets, the ability to quickly process and resample the data is paramount, and given the memory efficiencies of Polars, users can now tackle datasets that were previously out of reach without investing in additional hardware or complex optimizations.

Figure 7: An excerpt from our upcoming 1 billion point demo. Here five traces each containing 50 million data points are being manipulated in real time. Using the Plotly Resampler, you can zoom from a six month timeline all the way down to the sub-second for multiple sensor readings at the same time without loss of detail.

Switching to Polars inside the Resampler library not only unlocked the potential to work with more substantial datasets but also offered faster processing times for existing ones. Users no longer need to compromise on speed or scale. The Polars Plotly Resampler library is an ongoing project at Plotly with further enhancements coming soon. As of now, the Polars Plotly Resampler is in private preview, contact us if you want to test it out for yourself!

The Datashader Alternative

As an alternative to the Plotly Resampler, Datashader emerges as a top contender begin able to manage massive datasets effortlessly, a feature that stands as its foremost advantage. This ability stems from its efficient rasterization technique that renders data points as pixels, an approach that often proves to be more adept at handling enormous datasets.

Figure 8: This Datashader example features a line chart containing 24 million data points, which dynamically switches between using Datashader and Plotly traces as needed, depending on the x-range of the current zoom level.

Seamless integration with another revered visualization library, Bokeh, further enhances Datashader’s functionality, facilitating interactive plots that can engross viewers.. Especially in the domain of geospatial data visualization where large datasets are a norm, and where you want to downsample other 2D data representations (the Plotly Resampler is constrained to sequential 1D data such as time-series), Datashader finds its firm standing.

However, it is not without its share of disadvantages. For beginners, particularly those less acquainted with Python scripting, navigating Datashader can be a complex endeavor, whereas the Plotly Resampler can be deployed as a wrapper for a Plotly figure. Moreover, when standalone, it tends to offer lesser interactivity compared to other tools like Plotly. This restriction, coupled with a potential loss of finer data details in rasterized outputs, marks areas where Datashader seeks improvements. (Sidenote: We are looking at ways to integrate Datashader with Plotly charts / Dash to make it “easier”, and thus make this capability more accessible to our enterprise audience).

Conclusion

In our exploration of large-scale time-series data visualization, the Plotly Resampler’s capabilities, especially when integrated with Databricks ecosystems, shone through. It’s not merely the ability to handle vast datasets that distinguishes it, but also a signature dynamic interaction. These along with deep customization options, and notably enhanced performance makes it a pivotal tool for those dealing with expansive time-series data.

While Datashader is commendable, especially in its unique rasterization techniques and integration with Bokeh, it perhaps doesn’t provide the same level of intuitive interactivity that Plotly Resampler offers. This nuance makes the latter a preferred choice for industrial IoT time-series contexts.

Furthermore, the addition of Polars to the Resampler’s workflow accentuates the rapid progression and adaptability in the data science toolkit. As we continue to leverage emerging libraries, it becomes evident that the future holds no bounds for data professionals.

To wrap up, the combined strengths of Plotly’s charting library, Dash, Dash Enterprise, the Plotly Resampler and Databricks offer an unparalleled toolkit, empowering professionals to delve deeper into vast datasets and redefine the horizons of data analytics.