Brick by Brick: Build a multi-page dashboard (Sankey diagrams)

Simi Talkar
Analytics Vidhya
Published in
9 min readDec 20, 2020

Part 4: This is the fourth installment of a multi-part series on incrementally building a dashboard using Plotly Dash. We discuss data cleaning, data type recognition, formatting and preparation especially with regard to creating a Sankey and Bar chart. The Sankey diagram will show the number and types of properties in a neighborhood

Himalayan blue poppy

The data from Inside AirBnB provides publicly available data for AirBnB listings in Seattle, USA. We will use the dashboard framework we created in the first part of this series and include spatial data to explore the locations of the listings

Goal

We will be creating a Sankey chart for the dashboard as shown below. We also have a segmented (by neighborhood) bar chart, for each neighborhood group, that shows the same information but in a different style next to it. In the Sankey chart, each link shows number of a certain property type in a certain neighborhood. But it can also show the split of the types of properties within a neighborhood. The aggregates of property types as well as the aggregates of properties within a neighborhood are displayed upon hovering on a node. This is the additional information that the Sankey chart offers over the segmented bar chart, within the same space.

Two level Sankey
Sankey and bar within the dashboard

Installations

Follow along the instruction in this Medium article to set up and run the Dash app server from Jupyter notebook. The notebook in Github also lists the packages used as well as their versions to help you get going.

Code and Downloads

The code for this series can be found at this Github repo. The application is run from Jupyter notebook. There are .py (Python) files that will contain code for the layout and data manipulation. I used, and highly recommend, Visual Studio Code for working with these as there are a number of handy extensions available for formatting that is particularly helpful to build the HTML DOM structure and coding in Python, as well as formatting the code.

The files used to create the dashboard with the maps are:
Notebook:
This article uses : RentalSankeyChart.ipynb
The dashboard implementation of sankey and Bar can be found here
CreateDashboardBarSankey.ipynb
.py files :
layout_bar_sankey.py
datamanipulation_bar_sankey.py
callback_bar_sankey.py
.html files:
IndvListingsMap.html
NeighborhoodCountMap.html

Concepts

A Sankey Diagram is a visualization technique that allows to display flows (traditionally of energy, money or materials). Several entities (nodes) are represented by rectangles or text. Their links are represented with arrow or arcs that have a width proportional to the importance of the flow.
For further reference, you can visit this comprehensive catalog. There can be several levels of flow hierarchies depicted in a single Sankey chart. In our application, we will use this visualization to show the contribution of listings from neighborhoods to a neighborhood group and the contribution of listings of various property types to a neighborhood.

Data

The rental data can be viewed from several perspectives, as in the perspective of :
a) The renter
b) The landlord
c) The company AirBnB
d) The city council creating ordinances and regulations.

Slicing and segmenting this data in various ways enables us to provide insights to these stakeholders. Every dataset that lands in the lap of a data analyst, needs some massaging so we can squeeze out the information from it. Cleaning the data and adding additional summary or cleaned up columns is typically the first step.

Loading the data and the initial clean

To broadly describe the procedure, that is detailed in the code below, the createRentalDF function loads the listings_1.csv file that was extracted out of the listings.csv.gz file that we downloaded from the Inside AirBnB site. For the purposes of building visualizations for this dashboard, I am interested in some and not all of the columns and so after an exploratory data analysis I conducted, the columns I was interested in were picked out into the dataframe rental_df.

  1. Column information and datatypes

While most of the column names are self-explanatory, I searched for the meaning of the columns on the net, looked into the datatypes in the dataframe loaded from the file, and marked the ones I would like to clean and modify for visualization purposes. More importantly, I distinguished the categorical and numerical data types, the ordered and the nominal, the discrete and the continuous.
For information about these types, here’s an article you will enjoy. The below is a screenshot of a document I prepared to guide my cleaning of the dataframe.

Datatypes and column understanding

Note: From license on downwards in the above chart, all values are text

2. Missing information

I would like to highlight a few data cleaning tasks performed on some of the columns. The host_response_rate is a string with a percent at the end. This has to be made numeric by stripping off the percent sign and converting to float. But the field also has information missing for certain hosts. Quite a number of hosts (about 1/5) have multiple listings. We can leverage this fact to fill the absent response rate with a “typical” response rate by finding the mean response rate for a host and filling in this value for that host where the value is missing.
Note: Find the number of hosts with multiple listings with the below

 df1 = rental_df[“host_id”].value_counts().to_frame()
df1[df1[‘host_id’] > 1]

For more information on how missing values can be dealt with, you will find this article useful.

3. Converting string data to numerical

Some of the data that are categorical text, can be converted to numeric for further ML analysis. For instance, the host_reponse_time falls into four categories which are clearly ordered by importance since response “within an hour” has higher value that “a few days or more”. And so we assign a numerical value to these to facilitate analysis, as shown below.

4. Cleaning up and consolidating the property types

The property_type column in rental_df has a wide range, as has been listed in the notebook RentalSankeyChart.ipynb. To find meaning in this data, we will consolidate metaphorically similar properties so that we have a smaller number of properties to visualize and analyze better. Rooms are categorized as Shared or Private, independent of the property they are within (house/apartment). When a property is described as “Entire”, then the type of the property — Cabin/House/ Condominium/Cottage is brought to the forefront. Other consolidations can be seen in the function createPropertyTypeCol in the code below.

Setting up Sankey chart data

We have neighborhood groups forking into neighborhoods which fork finally into property types. Our nodes are therefore the Neighborhood Groups, Neighborhoods and the Property types we created above. The links will show the number of listings with each neighborhood group to neighborhood to property type connection. These aggregate counts will be created dynamically upon the selection of a neighborhood group by the user using Pandas group by and count function.

Plotly Graphical Object “go”, requires data to be arranged in a manner that helps us communicate the nodes and link weightages at each level of the hierarchy or flow as arguments to go.Sankey function. I will show you how we can simplify the creation of these lists for source, target and link values. The below is the creation of the Sankey chart and you can refer to the documentation for further details.

The two arguments, node and link, each of which is a dictionary object, are to be set up with desired attributes, and completely describe the chart.

node: The node names are specified within a dictionary in the label attribute. Here the list is populated with the names of the nodes for every level of the Sankey flow. We can customize the look of each node by selecting its thickness and the line enclosing it. We can also set a single color or provide a different color for every node (Hint: set the color argument as a list, with every element representing the color of a node). The hover tooltip is customized using a combination of “customdata” and “hovertemplate”. In customdata I have provided the node names and so,when the user hovers on a node, the template can use the node name to create a string to display the aggregate value of property listings for the node (whether neighborhood, groups or property types)

link: The link argument is another dictionary object with three of the keys : source, target and value being the lists that carry the connection information. For each of the links, the color is provided in the list sent in to the color attribute. And just as in the case of the node, a hovertemplate can be provided to customize the tooltip (or to override the default tooltip). It uses the customdata we set up in the node.

fig = go.Figure(data=[go.Sankey(
node = dict(
pad = 15,
thickness = 20,
line = dict(color = "black", width = 0.5),
label = label_list,
color = color_node,
customdata=label_list,
hovertemplate="%{customdata} has %{value} listings<extra></extra>",
),
link = dict(
source = source,
target = target,
value = values,
color = color_link,
hovertemplate="Link from %{source.customdata}<br />"
+ "to %{target.customdata}<br />has %{value} listings<extra></extra>",
)
)])

For instance, using pandas groupby we create a dataframe for the first level of the Sankey for the neigborhood group “Queen Anne” (this is the user selection in our dashboard filter). When we group the data by the neighborhoods within this Queen Anne, we get a dataframe as seen below. The column nbd_count_listings will provide us with the list we need to pass to the value argument above. We can turn this column to a list to use its values, but how about the lists required for the source and target?

Dataframe with aggregate listing count and indexes

It is now time to create lists that map as seen below, with the indexes positioned to align with the values.

First level Sankey List Nodes and Links (Neighborhood Group to Neighborhood)

To create the source and target indexes, we start by assigning an index to every node. Get a list of unique neighborhood names in the group and store it in “label_list”. Each neighborhood’s index will be its index position in this list. The Neighborhood Group selected by the user in the dashboard, is provided the index 0. Create a dictionary, mapping the name of the neighborhood in this list to its position in the label_list.

This dictionary is worth its weight in gold, since we can now use it to map the column values (neighborhood_cleansed in our dataframe to the indexes. The same technique is employed for the property types. We append all the available unique property types to the same list (label_list) thus assigning them a list index. This index is the value of the property type key in a dictionary that can be used to map the column property_type_class to an index value.

Code showing the creation of indexes to be used in the source and target lists

Let’s see the procedure above repeated for the more granular level of neighborhoods and property types.

Start with the groupby dataframe containing the count of listings.

sankey_df

Imagine the links you want to create.

Create the dictionary with keys set as neighborhood names and values set as the index of the neighborhood in the label list. Use the dictionary to map the column in the dataframe to the index.

The column of indexes created can now be turned into lists to be supplied to the source and target arguments in the go.Sankey function.

Converting column into lists for source, target and value

This pattern is highly scalable, since we can simply append new nodes for a new level to the label list and use the list index to set indexes for any aggregate level dataframe we create. The colors we want to assign to the links are added as columns to the dataframe as well and can be set as per node (neighborhood) value.

And Finally

I have shown you how to create the data structure to create a Sankey Chart in Plotly. I suggest that you now explore and solidify your understanding by expanding the chart to another level by forking out the property types that we consolidated earlier, into their individual property types for a more granular level for the rental user’s benefit. Assist them in renting the property their heart desires!

Feel free to comment, question and reach out.

--

--

Simi Talkar
Analytics Vidhya

Certified DS Associate (DP-100, DA-100), pursuing Masters in University Of Michigan’s Applied Data Science Program https://www.linkedin.com/in/simi-talkar/