How To Create Sankey Diagrams from DataFrames in Python

A wrapper for Plotly’s Sankey Diagram to make life easier

ken lok
KenLok
3 min readMay 8, 2019

--

Example of basic Sankey Diagram

The Sankey Diagram is a plot that can tell a story. It is a form of flow diagram in which the width of the flow arrows is proportional to the quantity of flow.

It is a perfect visual tool for visualizing the apportionment from an initial pot to subsequent smaller pots. Some great examples of how a Sankey Diagram can tell a beautiful visual story of Annual Household Spend as well as a Job Search Process.

So how do we create a Sankey Diagram. Let’s get to it

What makes up a Sankey Diagram?

A Sankey Diagram is basically made up of source and target pairs. Flows run from the source to the target. Each of these flows then have a value/size of the flow.

The Data

Let’s take for example a set of data that we would like to visualize using a Sankey Diagram. In Python, it is common to have your data in a dataframe. Below is a sample of the dataset we will be working with.

Sample Dataset

This format of dataset with levels in different columns can be very easily obtained using the ‘groupby().agg()’ function in python. Documentation on how to use it can be found here.

Example of basic Sankey Diagram

This is the final output we are trying to get. In a glance one can visually determine the flow of the data as well as the relative proportions of each of the flows.

Wrapper Function for Plotly’s Sankey

Below is the wrapper function that i used to generate the fig needed to create a plotly Sankey Diagram.

In short what the wrapper function does is
1. Take in a dataframe specified by the user
2. Creates ‘source’ and ‘target’ pairs according to the column specified by the user
3. Creates an aggregated ‘value’ for each ‘source’ and ‘target’ pair
4. Feed those values into the fig object as defined by Plotly
5. Return a fig object that can be used to create a Plotly Sankey Diagram

How to use the genSankey function

genSankey(df,cat_cols=[],value_cols='',title='Sankey Diagram')

df is in the same format as our sample dataset shown above,which is the standard output of pandas’ groupby().agg()

cat_cols is a list of the columns that you want to include in your flow diagram. So in this case it would be [‘lvl1’,’lvl2',’lvl3',’lvl4'] if we want to include all 4 levels in our diagram.

value_cols is a string that should be the column name of your ‘values’ for each flow. In this case it would be ‘count’.

title is as the name suggests, the title we would like to give to our Sankey diagram.

In python code, it would look like this

Running the above script would generate a Sankey Diagram in a html file that you can save to your local drive.

Voila! A quick and easy way to create a Sankey Diagram from your dataframe.

If you are interested to modify the function on your own, the docs for Plotly’s Sankey can be found here

Do leave a comment and let me know if this helped you!
Look out for Part 2 where i will explain the genSankey function and what it does under the hood in greater detail

--

--