[Plotly + Datashader] Visualizing Large Geospatial Datasets
Introduction (what we’ll create):
Unlike the previous tutorials in this map-based visualization series, we will be dealing with a very large dataset in this tutorial (about 2GB of lat, lon coordinates). We will learn how to use the Datashader library to convert this data into a pixel-density raster, which can be superimposed on a Mapbox base-map to create cool visualizations. The image below shows what you will create by the end of this tutorial.
Structure of the tutorial:
The tutorial is structured into the following sections:
Pre-requisites:
This tutorial assumes that you are familiar with python and that you have python downloaded and installed in your machine. If you are not familiar with python but have some experience of programming in some other languages, you may still be able to follow this tutorial, depending on your proficiency.
It is very strongly recommended that you go through the Plotly tutorial before going through this tutorial. In this tutorial, the installation of plotly and the concepts covered in the Plotly tutorial will not be repeated.
Also, you are strongly encouraged to go through the ‘About Mapbox’ section in the [Plotly + Mapbox] Interactive Choropleth visualization tutorial. We will not repeat that section here, but it is very much a part of this tutorial.
About Datashader:
Quoting the official Datashader website,
Datashader is a graphics pipeline system for creating meaningful representations of large datasets quickly and flexibly
In layman terms, datashader converts the millions of lat-lon coordinates into a pixel-density map. Say you have a million lat-lon coordinates bound between latitudes [x,y] and longitudes [a,b]. Now, you create a 100x100 pixels image with the corners corresponding to the extreme lat-lon pairs. So you now have a total of 10,000 pixels. Each pixel corresponds to a physical tile of say 100 sq. km. (actual area will depend on the values of x,y,a,b). Now, if tile1 has 100 lat-lon coordinates within it and tile2 has 1000 coordinates, tile2 has a coordinate density 10 times higher than tile 1. Thus, the pixel corresponding to tile2 will be 10 times brighter than the pixel corresponding to tile1. So essentially, a million lat-lon coordinates now get converted into 10,000 pixel-density mappings. Essentially, the coordinates have been converted into a raster image. This is what makes datashader so powerful.
Installing datashader:
If you are using Anaconda,
conda install datashader
Else, you can use the pip installer:
pip install datashader
See the Getting Started guide on the datashader website for more information.
Getting started with the tutorial:
GitHub repo: https://github.com/carnot-technologies/MapVisualizations
Relevant notebook: DatashaderDemo.ipynb
View notebook on NBViewer: Click Here
Import relevant packages:
import dask.dataframe as dd
import datashader as ds
import plotly.express as px
Note the import of dask.dataframe instead of pandas. Because we are dealing with a large dataset, dask will be much faster than pandas. For perspective, the .read_csv() operation takes 19 seconds with pandas and less than a second with dask. Click here to know more about why dask is preferred for large datasets. The gist is that dask utilizes all the cores on your machine, which pandas is unable to do.
Import and clean data:
Since the relevant CSV for this tutorial is about 2 GB large (74 million + coordinates), it was not possible to host this on GitHub. It can be downloaded from this Google Drive link. It is recommended that you download this file and save it into your data folder. Once that is done, you can simply import it like any other CSV.
Note: Make sure that you don’t have any other heavy software open when you are loading this dataset, especially if your RAM is comparable to the file size.
df = dd.read_csv('data/lat_lon_data.csv')
Now, we will perform some basic cleaning of the data. Since our region of interest is India, we will make sure that all coordinates outside the lat-lon bounds of India are excluded.
#Remove any unwanted columns
df = df[['latitude','longitude']]#Clean data, remove any out of bounds points
df = df[df['latitude'] > 6]
df = df[df['latitude'] < 38]
df = df[df['longitude'] > 68]
df = df[df['longitude'] < 98]
Creating the datashader canvas:
cvs = ds.Canvas(plot_width=1000, plot_height=1000)
agg = cvs.points(df, x='longitude', y='latitude')
# agg is an xarray object, see http://xarray.pydata.org/en/stable/coords_lat, coords_lon = agg.coords['latitude'].values, agg.coords['longitude'].values# Corners of the image, which need to be passed to mapbox
coordinates = [[coords_lon[0], coords_lat[0]],
[coords_lon[-1], coords_lat[0]],
[coords_lon[-1], coords_lat[-1]],
[coords_lon[0], coords_lat[-1]]]
We have created a 1000 x 1000 canvas cvs
. Next, we projected the longitude and latitude from the dataframe onto the canvas, using cvs.points
. Then we fetch the projected coordinates and determine the corner points for the image.
Now that we have the canvas ready, let us define the colormap for the visualization. We will use the hot
colormap. You can use other alternatives, like fire, or any other color map of your choice.
from matplotlib.cm import hot
import datashader.transfer_functions as tf
img=(tf.shade(agg, cmap = hot, how='log'))[::-1].to_pil()#pil stands for Python Image Library
A couple of things to note here. We are using a transfer function to shade the projected coordinates, using the hot
colormap. We have specified the mapping methodology as log
. This is to ensure that even the low-intensity points get represented adequately in the visualization. If we chose the linear
mapping, then the high intensity points completely overshadow the low-intensity points.
Another mapping option is eq_hist
, which produces a result similar to the log transformation. You can read more about it here. A comparison of the outputs of the 3 transformations in shown below.
As you can see, almost nothing is visible with the linear transformation. This is because a couple of pixels with extremely high intensity have overshadowed all others. You will need to zoom-in to identify those hotspots.
Similar to the transformation, different color map options are also available. To get the list of all color maps, click here. Below, the examples with a few different color maps are shown.
Creating the visualization:
fig = px.scatter_mapbox(df.tail(1),
lat='latitude',
lon='longitude',
zoom=4,width=1000, height=1000)# Add the datashader image as a mapbox layer image
fig.update_layout(mapbox_style="carto-darkmatter",
mapbox_layers = [
{
"sourcetype": "image",
"source": img,
"coordinates": coordinates
}]
)
fig.show()
Here, we are plotting just one point from the dataframe (the last one), so that plotly can create the scatter visualization. We are using the carto-darkmatter style from Mapbox and overlaying the image output of datashader as a layer on top of the visualization. Congratulations!! Your visualization is ready!
When to use this library:
The answer is perhaps the simplest for this library. Use this when you have a very large data set. If you find this visualization aesthetically appealing as I do, then you can use this for smaller datasets as well, but the results will depend on the density distribution of your data. You won’t get high interactivity, because datashader essentially overlays an image on the Mapbox base-map. But you can still zoom and pan the visualization.
References:
- Datashader Website: https://datashader.org/
- Plotly With Datashader: https://plotly.com/python/datashader/
We are trying to fix some broken benches in the Indian agriculture ecosystem through technology, to improve farmers’ income. If you share the same passion join us in the pursuit, or simply drop us a line on report@carnot.co.in