Building a Big Data Geographical Dashboard with Open-Source Tools

Mapping the U.S. power grid with Dash, Dask, and Datashader

JP Hwang

Published in

Plotly

16 min readSep 3, 2021

📌 Watch our previous webinar on the power grid.

Author: JP Hwang
Source code: https://github.com/databyjp/datashader_powergrid
App: https://dash.gallery/datashader-powergrid/

Introduction

Investing in infrastructure is critical for societies. Given our ever-growing need for power and more ambitious emissions reduction goals, it’s no surprise that there is significant recent buzz about power infrastructure.

We are not infrastructure experts. But we do know data, and that high-quality data and analyses are critical for decision making processes. We built the above app to demonstrate how Dash can help researchers, developers and the public explore geographical data, which can present some unique challenges.

This Dash app is meant to empower users to easily discover insights from the dataset through its map, overlays and summary figures. Take a look at the next image:

Wind speed data overlaid with grid and current wind power plant locations

It shows the current wind power plant data and grid availability over the top of a layer showing potential renewable energy (i.e. mean wind speeds).

It reveals, at a glance, suitable locations for further large-scale wind installations. The equivalent data for solar power is also available with the click of one “Focus (Presets)” button.

But before delving into what the app can actually tell us, let’s first talk about the app itself.

Background & Motivation

This app was built with open-source tools, and is itself open-source (publicly available at this repository) to encourage you to explore and adapt it. Bill Gates was recently quoted touting the power of open-source tools in powering his philanthropic efforts.

“I have this effort to create an open-source model of electricity demand generation that includes weather models, so the countries that have made really aggressive commitments about renewable use can see that their grid is going to start being reliable. … The fact that I’m running an open-source model to test whether these aggressive goals are achievable, it blows the mind.” — Bill Gates

That the founder of Microsoft is advocating and marveling at the power of open-source software is a glowing endorsement of the open-source community’s efforts and its utility.

Another exciting feature that makes this app stand out is that it was built to be scalable. And we mean really scalable, as you will see later. Apps like this could be used to be able to easily manage far larger sizes of datasets; to really make use of big data.

To accomplish this task, the app was built with a core technology stack that includes Dash, Dask and Datashader (with an additional option for using Coiled). We think this is a great solution for managing big geographical (or geometric) data, and here we discuss the why as well as the how.

If you haven’t used one or more of those technologies, don’t worry — this article will introduce them, briefly discuss their strengths and use cases, and present how they are used here to manipulate and present geographical data.

Again, everything here is open-source (see it here). We encourage you to download the app, follow along, and fork the code to your heart’s content!

Let’s get started!

Dashboards with Dash / Dask / Datashader

A data dashboard must do three things at a minimum. It needs to: 1) load and manipulate the data, 2) construct plots for the user to see, and 3) present a front end for the user to interact with. Each of these three tools essentially take on each of those jobs for the present use case. Let’s briefly discuss each of them.

Dask

Many, if not most, Pythonistas would have used pandas for data manipulation.

Although it is a great tool, pandas is limited by the memory and compute power of the machine that it is running on. The same goes for the rest of the “PyData” stack, such as numpy. This means that on your laptop, you will likely not be able to load a dataset larger than your RAM, let alone perform operations on them.

Dask seeks to address this challenge by enabling distributed computing with similar interfaces to the PyData stack.

Dask dataframe enables processing of larger-than-memory datasets by breaking down the dataset into multiple dataframes and coordinating them, whether it be on a single machine, or distributed throughout a cluster. So where a pandas dataframe may be loaded with:

import pandas as pd
df = pd.read_csv(“…”)

A Dask dataframe may be loaded with:

import dask.dataframe as dd
df = dd.read_csv("...")

This consistency in interfacing between pandas and Dask dataframe is one of its strengths and Dask’s core tenets. While it’s not quite as simple as replacing all your pd function calls with dd, it does come close. Adapting an app to use Dask instead of pandas does not require a complete rewrite.

Dask was chosen here for data analysis and manipulation.

Datashader

Datashader is another tool built to deal with very large datasets.

Rather than plotting individual points as many graphics packages do, Datashader aggregates data onto a grid before plotting. If that sounds unintuitive, think of it as converting the source data into a spatial histogram. Datashader efficiently generates visuals that convey impressions of patterns and properties of data regardless of how big a dataset it may be dealing with. For instance, see below.

300 million data pointsplotted with Datashader (Source: datashader.org)

The image here was plotted from hundreds of millions of rows of points from the 2010 census using Datashader’s default settings. It is a beautiful visualisation as well as an informative, impactful one. Notice how it provides an instant, unmistakable impression of population densities throughout the continental United States.

Accordingly, Datashader was used for a significant portion of our visualisation tasks. Below, the colour contour of wind speeds as well as the lines of grid segments are both rendered by Datashader.

Dash

Plotly’s Dash is a framework for building data visualization apps with custom user interfaces. It abstracts away the tasks required for building a web application, so that a user interface, or the front end, for an entire web app can be built in a short time without ever leaving Python. A small sampling of the available demo apps is shown below from our Dash Enterprise App Gallery.

Although we may be biased, Dash is our tool of choice for building powerful data apps, and so it was selected to handle the front end.

In summary — Dask enables scaling the back end as datasets or compute requirements grow, Datashader allows rendering of figures from big, big datasets, and Dash enables scaling the front end, for example as the number of simultaneous users increase. Each performs a critical function in this stack to ensure that this app is robust and scalable.

Now, then, let’s talk more concretely about how the Power Grid Explorer app utilises these tools.

The app in detail

Managing big (geo)data with Dask

A few different datasets are used in the app. The solar/wind power plant data is relatively simple, containing their latitudes/longitudes and bibliographical information. The electric power transmission lines (i.e. power grid) dataset is large, and contains some unusual components. The wind speed and solar radiation data are somewhere in between.

Even though we provide a preprocessed version of the dataset through the source code, let’s spend some time reviewing the steps in the preprocessing.

The raw data provided by the U.S. Energy Atlas is in a shapefile (.shp) format, which looks like this:

The geometry column and their constituent Linestring objects stand out as unusual here. They are one of the classes designed to represent geometric data such as polygon, or multipolygon objects, and can contain any number of segments of lines. For instance, the first row here contains a series of (x-y) coordinates (31 pairs to be exact) as shown here.

These are tricky to manage with vanilla pandas, and normally geopandas (like pandas, but for geometric data) would be a good solution. Just like pandas, however, geopandas is not suitable for using directly with Dask, and so we use spatialpandas here instead. We begin by building a spatialpandas GeoDataFrame:

from spatialpandas import GeoDataFrame
df = GeoDataFrame(trans_lines)

One key feature of spatialpandas is its packing of partitions based on geometry. This means the resulting dataframe is split into spatially optimised (i.e. divided by their positions in space) partitions like so:

df = df.pack_partitions(npartitions=2)

Its impact becomes more and more important as the dataset size increases, as you can imagine performance penalties for constantly loading different Dask partitions from disk onto memory.

Additionally, the location data in longitudes/latitudes is converted to a different coordinate system (EPSG:3857) based on Eastings/Northings which we will use later. Geopandas provides the built-in .to_crs() method for this job, so we modify the “geometry” column with the appropriate coordinate system:

df = df.assign(geometry_ll=df["geometry"])  # Save lon/lat geometry
en_geom = trans_lines["geometry"].to_crs("EPSG:3857")  # Convert data
df = df.assign(geometry=GeoSeries(en_geom))  # Add new data

Those are the main aspects in preprocessing, and now we can save the dataset (for full details, see convert_data.py).

The dataset can be loaded with spatialpandas’ spatialpandas.io.read_parquet_dask function in the main app. We note that we load most columns as explicitly 32-bit data types to ensure that no unnecessary memory use is incurred from use of 64-bit data types.

df[“VOLTAGE”] = df[“VOLTAGE”].astype(np.float32)

As Parquet files do not retain the variable’s category metadata (see more information here), we also process the categorical variables so that the appropriate metadata is generated:

df = df.assign(STATUS=df[“STATUS”].astype(“category”))

Once all of that’s done, we can persist the processed dataframe into memory by executing df = df.persist(). This is a key difference for Dask dataframes compared to pandas. Dask is lazy by default, and as a result running df.persist() strategically to compute and retain the results in memory is good practice (read more here). Now our dataset is ready to be published for Dask clients to use.

That brings us to the last point on Dask — the nature of distributed computing means we do need to make a few more deliberate choices in how the actual distribution is carried out under the hood.

In our case, we set up a dask.distributed scheduler (read more) to which the dataset will be published. This means that the dataset is persisted across the distributed memory no matter what. So the dataset will continue to be shared and available to whatever number of Dash clients (read more), rather than each Dash client having to load the data individually, or a Dask server having to re-process the dataset every time it is requested. Once the dataset is published like:

client.publish_dataset(df=df)

From the Dash app’s perspective, the data can be accessed by a simple line:

df = client.get_dataset(“df”)

The datasets describing wind and solar potentials are loaded this way also. One key step to perform with those dataset is convering the TIF file from the NREL websites to CSV, for which we used raster2xyz. Once that’s done you will have files with XY coordinates and output (Z) values.

The solar radiation data is in longitudes and latitudes, and wind data in ESRI:102009 projection.

Just as an aside, provisioning local clusters may be troublesome, if not downright impossible depending on your needs. In which case, Coiled might be a good solution — who provide scalable remote clusters for jobs like this.

Still with us? Good, because that was probably the most challenging part of this article. Now let’s get to the fun stuff and draw some pretty pictures with our data.

Rendering with Datashader

Let’s begin by importing Datashader, using the convention to alias it as ds like so:

import datashader as ds

Then, the general steps are to create a canvas…

cvs = ds.Canvas(plot_width=width, plot_height=height)

aggregate the data (to lines in this case)…

agg = cvs.line(df, agg=ds.any(), geometry=”geometry”)

and plot the aggregated data:

img = tf.shade(agg)

While there are obviously many variations and arguments to be passed, these are the three main steps to generate an image with Datashader.

More concretely, for our use-case, we will vary these arguments to assist turning data into insights.

Canvas size: ds.Canvas(plot_width=width, plot_height=height)

Although when it comes to video games higher resolutions are almost always better, our Datashader plot can benefit from lower resolutions in some situations. As Datashader aggregates data, a lower resolution means that the resulting image is going to provide more of a “higher-level” overview.

For example, compare the below Datashader plots where the colour indicates total power output (per aggregated data point) for solar power plants. The shown plots are plainly at two different resolutions, and while the figure on the right may be arguably aesthetically more pleasing, the figure on the left makes it easier to see where the solar power capacity lies throughout the United States. Thus the choice of canvas size, or resolution, is a key factor in how the information is to be presented.

Total solar power capacities — at different Datashader canvas resolutions

That brings us to the next point, which is to specify what the figures are to aggregate.

Aggregation variable: cvs.line(agg=…)

This parameter specifies how the data is to be aggregated. For example, ds.any() will simply count the presence of any matching data (e.g. power plants) in that canvas grid, whereas ds.sum(“Total_MW”) will return a sum of Total_MW column for the data in that grid point. The figures below left simply show counts while the below right shows their total capacities, and they leave distinctly different impressions.

Power plants plotted by count (left) and by total capacity (right)

Once the Datashader aggregation is complete, we generate the image like so:

img_out = tf.shade(agg, cmap=cmap)[::-1].to_pil()

The same principles apply in generating the plot of wind speeds, or solar radiation incidence.

Map shaded by average solar radiation (left) or wind speed (right)

Having said all that, we’ve yet to discuss how the underlying map is generated. How does the app generate the map, know which portion of it is being shown, and align the map with the Datashader image?

With Dash, of course.

Front end with Dash

As mentioned above, Dash is a framework for building powerful data visualization apps in a short time without ever leaving Python. What this also means is that in Dash apps, events can be captured and used in Python for callback functions also.

In our case, we use the relayoutData property from graphs rendered in Dash (with dcc.Graph objects) that relay information on layout-level edits by the user. So when the user zooms or pans across, the dcc.Graph object sends back information that describes what is being shown. It may be the x and y ranges for some charts, and for maps such as this, the center and the zoom information.

But first we must generate a map. Because we will be overlaying the Datashader image to display the data, the map simply needs to show the underlying geometry. Accordingly, we determine the bounds of the geometry and generate a map like so:

dummy_df = pd.DataFrame([{"x": latmin, "y": lonmin}, {"x": latmax, "y": lonmax}])
fig = px.scatter_mapbox(dummy_df, lat="x", lon="y")

Here is where the previous data conversion was necessary from latitudes/longitudes. The map is styled with Carto tiles as you can see below (read more here):

fig.update_layout(mapbox_style=”carto-darkmatter”)

As Carto (along with OpenStreetMaps) uses EPSG:3857 projection, the Datashader image must be rendered with the same projection. (If you were to generate the map with the wrong projection, the resulting misaligned image will have you checking your eyes for double vision.)

At the same time, we also use the relayout data to filter our dataframe to only include that to be shown on the map. This is relatively trivial. The centre lat/lon coordinate and the zoom level are used to calculate the bounds of the map and filter the dataframe like so:

tmp_df = (df_in
 .query(f”LAT > {lat0}”)
 .query(f”LAT < {lat1}”)
 .query(f”LON > {lon0}”)
 .query(f”LON < {lon1}”)
)

This filtered dataframe will be passed onto Datashader to be rendered, whereby the rendered image is added to a dictionary and added to the Plotly figure as an additional layer.

grid_layer = {“sourcetype”: “image”, “source”: img_out, “coordinates”: curr_coords_ll_out}
mapbox_layers.append(grid_layer)
fig.update_layout(mapbox_layers=mapbox_layers)

The result of all this is the below.

As we zoom in and out (and pan), the Dash app dynamically responds to our inputs and renders the data on-the-fly. Notice that the same filtered data is also used to dynamically generate the legend for the map!

It’s simply magic. 🪄

One last nifty app feature we wanted to mention here is use of layers. Maps, more than perhaps any other forms of visualisation, often use layers to present the Goldilocks level of information to the user (not too much, not too little).

So in our app, each dataset here is plotted as discrete layers also. Datashader is plotting multiple discrete images, rather than combining the underlying data and plotting one image. Plotly.py allows this type of figure construction (through the mapbox_layers argument that you saw above) and we use it to build toggles to switch individual layers on or off through Dash’s callback functions as you see in the figure below.

The general syntax for each layer is something like…

if “layer_name” in layers_list:
 … # build specific_layer
 mapbox_layers.append(specific_layer)

and then updating the figure with…

fig.update_layout(mapbox_layers=mapbox_layers)

…where layers_list is the list of layers to be shown.

We won’t go into the rest of the app in detail; if you’ve used Dash before they are relatively similar (and good job on sticking with us all the way to here). But put plainly, there is a lot of information being displayed, as well as controls to power the users’ exploration.

As we mentioned, this could be easily scaled to datasets that are orders of magnitude larger. All of it is enabled by a core stack of Dask, Datashader and Dash.

Insights from the exploration

Now that we’ve spent all this time talking about how the app has been built, let’s spend just a little bit of time looking at what the app can tell us.

The below image hides everything but the power plants. It shows that wind power largely comes from the midwest and southwest, while solar power is dominated by the coastal regions.

Wind/solar power plants in the United States

These locations might seem slightly unintuitive at first, so let’s see if the other data layers can help us out. We begin by looking at available solar radiation as well as the power grid as captured in the image below.

Solar radiation, solar power plants and the grid

Clearly, many of the solar power plants shown here follow the high-voltage grid lines, as well as the high availability of sunlight in the southern and coastal areas (and Minnesota, for whatever reason). The same logic could be followed for identifying areas where potentially new solar plants could be built.

The app is designed to help with such tasks by allowing the user to select an area and performing a number of calculations to assess its energy potential. The below image shows such an example, with an area just under 380 km2 selected in Arizona just northeast of Phoenix.

The app estimates that the selected area would theoretically generate something like over 1% of the current electricity demand for all of the United States in a year.

By the way, some of you may have noticed that we changed the map style to better be able to see the underlying topology. Plotly allows you to choose from a number of map tile types, and we have made the app adaptable to public styles.

The equivalent analysis of wind might be even more helpful.

To start, the map below, showing average wind speeds, immediately explains much of why so many wind power plants are located where they are.

The corridor of higher average wind speeds through the middle of the country clearly stands out, and coincides with the concentration of wind power plants. And once the grid data has been incorporated, it becomes much easier to identify potentially suitable areas for new wind farms. We can simply exclude areas where the grid connectivity or wind speeds are just not available.

Wind speeds, wind power plants and the grid

The data explorer shows that wind power plants and grid connections essentially go together as one would expect. Again, this can help us to identify potential sites for further wind farms. For example, the area here in Wyoming for example might be a good potential site that is underutilized, with relatively high wind speeds and proximity to the grid.

Obviously power potential and grid proximity are but only two of innumerable factors in choosing a site for new power plants.

It would not be viable to replace an airport with a wind farm, or a highway with a solar array. But you can see how this app might really help to speed up the process significantly by identifying potential candidates.

Wrapping up

Geographical data can be extremely complex and voluminous, but also fascinating and extremely valuable. This type of data readily lends itself to exploration in the visual form. We hope that this showcase app whet your appetite as to how such data can be wrangled and turned into wisdom—faster.

Using the right tools in Dash, Dask and Datashader enables us to turn big data into true insights by processing it, aggregating it, and presenting it to its users in a visual, interactive format. Our app focuses on demonstrating scalability and we are confident that you can adapt its concepts and code to suit your own needs.

All in all, we hope to have lowered the barrier to entry in getting started on your own journey into managing big geographical data, as well as saved you some time. We’ve found packages like Dask and Datashader to be invaluable parts of the data science and visualisation toolkit, and have no doubt that you will too.

We can’t wait to see what you build next with these tools and Dash.

See you next time!

Data sources

Electric grid dataset https://atlas.eia.gov/datasets/geoplatform::electric-power-transmission-lines/about
Solar power plants https://atlas.eia.gov/datasets/eia::solar-2/about
Wind power plants https://atlas.eia.gov/datasets/eia::wind-2/about
Solar resource map https://www.nrel.gov/gis/solar-resource-maps.html
Wind resource map https://www.nrel.gov/gis/wind-resource-maps.html
U.S. electricity consumption https://www.eia.gov/energyexplained/use-of-energy/electricity-use-in-homes.php
Renewables capacity factors https://www.eia.gov/todayinenergy/detail.php?id=14611
Wind turbine efficiency https://www.nrel.gov/docs/fy09osti/45834.pdf
Solar PV efficiency https://www.nrel.gov/pv/cell-efficiency.html