How to combine Dash and Dask for analytic apps that scale — and perform

Published in

Plotly

8 min readJul 7, 2021

Hear directly from the CEOs of two of the industry’s most exciting startups: Plotly’s Jack Parmer and Coiled’s Matt Rocklin. Data Visualization Expert, JP Hwang demoes how to architect our own Dash+Dask apps.
📌 View the recorded webinar!️

Introduction

The phrase “big data” has become truly ubiquitous over the last few years, escaping the specialist domain of data scientists. This evolution has not happened without reason. Companies who can leverage the vision big data can grant are dominating the marketplace and the analysts who can parse genuine insight from large and growing datasets carry a premium.

Most data science and data visualization tools, however, are simply not designed to deal with datasets containing many millions of rows or terabytes of data. Yet that’s precisely where Dask and Dash shine — and why they are becoming the new standard tools for Python data scientists dealing with big data.

With Dash, Pythonistas can build rich full-stack analytic web applications without writing any JavaScript. With Dask, a data scientist can implement high-performance parallel and distributed data processing workflows in pure Python using familiar APIs.

In this article, we’ll explore:

How Dash and Dask make it possible to present interactive visualizations based on high-performance data analysis workflows to a wide audience
How to Connect Dash and Dask and construct your app’s software architecture
How these projects complement each other to extend the reach of what a Python data scientist can accomplish
Two example use-cases featuring 40 million cell towers and 7.4 million ships

Dask

Dask is a flexible library for parallel and distributed computing in Python. At its core, Dask supports the parallel execution of arbitrary computational task graphs.

Built on this core, Dask provides a variety of convenient high-level interfaces that mimic popular PyData projects. In particular, it provides a distributed DataFrame that mimics Pandas, a distributed Array that mimics NumPy, and a machine learning library that mimics Scikit-Learn,

Additionally, Dask has been embraced by the larger PyData community and has been incorporated as an optional execution environment for many other projects including XArray, Datashader, XGBoost, HoloViews, Airflow, and more.

Dask provides several scheduler implementations to efficiently handle a wide range of hardware configurations, from single-core laptops to clusters with hundreds of machines. The most flexible and feature-rich scheduler is Dask.distributed, which can be used for multi-process parallelization on a single multi-core machine or distributed parallelization across a cluster.

A powerful feature of Dask.distributed is the ability to persist a distributed data structure (e.g. a Dask DataFrame) into memory across the workers in a cluster and then publish it to the cluster so that it can be accessed by other processes connected to the cluster.

Dash

Dash is a Python framework for creating rich analytic web applications from pure Python. The dashboard logic is implemented in Python, and so it has seamless access to the full PyData technology ecosystem. The document structure is also implemented in Python, and Dash presents this structure to users using React components and Plotly.js figures.

For more information on the relationship between Dash and React, see this previous post.

Dash’s programming model was specifically designed to support efficient horizontal scaling. Because all per-session application state is stored in the browser, server memory usage does not appreciably increase as the number of simultaneous clients increases. Another advantage of this architecture is that successive callbacks from the same user session do not have to be handled by the same server process, or even by the same server. This makes it possible for a relatively small pool of web worker processes to efficiently handle requests from a relatively large number of concurrent users.

As user demand grows, multiple servers, each with multiple web worker processes, can be configured to serve a Dash app using a load balancer like NGINX.

Connecting Dash to Coiled

Logging in

Locally, you will probably use the command line tool coiled login and following the prompts to log in to the service. As this may not be an option for remote deployment, we suggest saving your coiled token as an environment variable, and running the below script to log in:

import os
coiled_token = os.environ['COILED_TOKEN']
os.system(f"coiled login --token {coiled_token}")

Remote environments

Coiled’s cluster will need to run in a software environment mimicking your local one to perform its computations. However, requirements for the remote environment may be different than for your local one.
For example, you may have been loading the data from a local drive, but have set up the coiled cluster to access an AWS S3 bucket to do so. In this case, you would need to install s3fs so that the cluster is able to access s3. Add it to your list of required packages for conda or pip (see here for more details).

Cluster uptime

Although Coiled does provide a generous free tier, it’s obviously not unlimited. Meanwhile, a Dash app needs a cluster to be running to function, unlike use-cases involving one-off computes.
So pass on shutdown_on_close=False parameter when initializing a cluster, and potentially also specify a longer idle timeout option also, as in the example below.

cluster = coiled.Cluster(
    name="test-clust-1",
    ...
    shutdown_on_close=False,
    scheduler_options={"idle_timeout": "1 hour"}
)

One option is to provide an option for the user to manually restart the Coiled cluster. It can for example be achieved as done here through a button on the Dash app and the associated callback function:

@app.callback(
    Output("placeholder", "children"),
    [Input("restart-btn", "n_clicks")],
)
def restart_coiled(n_clicks):
    if n_clicks is not None:        global client
        client = get_client(client)        global df
        df = load_df()        return dash.no_update    else:
        return dash.no_update

Deployment Architecture

The dashboard repository includes a launch.sh script that can be used to serve the dashboard from a single machine. This script has four steps:

Step 1: Launch Dask scheduler

First, a Dask.distributed scheduler is launched using the dask-scheduler command.

$ dask-scheduler — host 127.0.0.1 — port 8786 &

Step 2: Launch Dask workers

Second, a collection of Dash.distributed workers are launched using the dask-worker command. This command requires the address of a Dask.distributed scheduler. In this case the scheduler and workers are launched on the same machine, but in a cluster configuration the scheduler and workers would generally reside on different machines. The number of worker processes is controlled by the — nproc flag.

$ dask-worker 127.0.0.1:8786 — nproc 4 &

Step 3: Publish data

Third, a load_data.py Python script is executed. This script will connect to the Dask.distributed scheduler, load and preprocess the dataset as a Dask DataFrame, persist the Dask DataFrame so that it is stored in memory across the workers, and finally publish the DataFrame so that it can be accessed by name from any future processes connected to the scheduler.

$ python load_data.py 127.0.0.1:8786

Step 4: Launch Web Workers

Fourth, gunicorn is used to bind a socket to an externally visible port (8050 in this case) and launch multiple worker processes that load and serve the dashboard app on this socket. The number of gunicorn workers is controlled by the — workers flag. From each of the gunicorn worker processes, an instance of the dashboard app connects to the Dask.distrubted scheduler.

$ gunicorn “app:get_server()” — bind 127.0.0.1:8050 — workers 3 &

Using this approach, a single in-memory Dask DataFrame can be shared across multiple web worker processes.

Enterprise Deployment

The supported number of concurrent dashboard users can scale by replicating servers behind a load-balancer like NGINX or DEK.

Unlike other data visualization tools, Dash was designed with a stateless back end.

Being stateless input/output machines, Dash Enterprise Kubernetes (DEK) can add Dash servers dynamically without the overhead of plugging into a global back-end store.

Let’s take a look at two example use-cases for this tech.

World Cell Towers App

Code: https://github.com/plotly/dash-world-cell-towers
Link to app: https://dash-gallery.plotly.host/dash-world-cell-towers

In this example, we’ve used Dash and Dask together to build a dashboard that supports the interactive exploration of every cell tower in the OpenCelliD dataset. This dataset includes the latitude/longitude coordinates, cell radio type, construction date, and signal range for over 40 million cell towers across the world.

The radio, signal range, and construction year are each aggregated and displayed as histograms. Towers can be filtered by performing selections on any combination of these histograms.

Datashader is used to create an image of the latitude/longitude coordinates of the towers, colored by their radio type, and this image is overlaid on an interactive Mapbox map. Each time a user performs a zoom or pan action on the map, this image is regenerated for the new viewport. Thanks to Datashader’s built-in support for Dask DataFrames, these computations are performed in parallel across all of the available Dask workers in the cluster.

Click here to see out the app for yourself

AIS Ship Data Explorer

The following app helps users to easily navigate the US Coast Guard Automatic Identification System (AIS) vessel-location dataset.

Each day, the US Coast Guard collects information from millions of vessels such as its location, time, ship type, speed, and length. And aggregation of this data by these variables can reveal valuable insights.

Over the course of just a few days, the volume of information collected adds up to gigabytes of data, and the aggregate data can easily reach terabytes over years.

Dask makes aggregating all this data possible through parallelization. In the example above, millions of rows of data are processed and visualized virtually instantly, and scaling the cluster makes dealing with larger datasets a breeze.

Conclusion

If your organization’s success depends on your data scientists using Python to solve ambitious big data problems and deploying those solutions easily while staying performant, Dash + Dask are what you need.

Scale to the cloud for computing power, ease, and speed and use the data visualization tool that accomplishes what PowerBI and Tableau cannot — all tuned for the modern requirements of enterprise.

To learn more about Dask and Dash, watch the recorded webinar.

Special thanks to Coiled’s Oren Herschander, Gus Cavanaugh and Fabio Rosado for their support.