Sankey Diagrams in Python

Balakrishna Ch
4 min readJul 18, 2022

--

For some reason, I have a liking towards Sankey diagrams though I used it in real life only once. It’s a useful plot of how the whole unit is getting fragmented as time progresses — something like a stream which splits into sub branches as it moves along. It is a very usual and intuitive visual to tell us how the flow is structured.

The original Sankey Diagram from which this theoretical model evolved

Now, there is no exact reason why one should use Sankey. After all, didn’t US Bureau of Economic Analysis use this famous pie chart train?

Now, if you want to plot the same as a Sankey, this is how it would look like.

Or if I remove this absurdly high other component —

This is a much cleaner image to look at. Now, each one’s perception is different; I will leave it to you which one do you prefer.

Now, coming to Sankey. How do you plot it? The two main python options we have is bokeh through holoviews and plotly. There is a pySankey libray but I wouldn’t prefer it because of two reasons

  1. It is specific
  2. You need to specify colours as input

As you would have noticed by now, a Sankey diagram(and it’s variant Alluvial diagram) needs three inputs — source, target and volume of flow.

There are two ways you can plot this in python.

  1. Pass the node names as a part of the dataset
  2. Pass the node names as indexed numbers and maintain the node list separately.

Bokeh with Holoviews offers both the options while Plotly offers only the second option.

Let’s look at the first option.

Consider the below dataset(the same image as above)

There are three columns here — source, target and value. Remember that the value field should not consider the parent field as 100%. It should consider it only as the actual percentage of the source. Consider the two below.

The first one treats Services by Condition as 14%(82.80% of 17.40%) while the second on treats it as 82.8%. The same, though visible in subsequent levels is not that prominently visible in the plot, though.

The plotting is straightforward.

import pandas as pdimport holoviews as hvhv.extension('bokeh')edges = pd.read_csv('data/health-breakup2.csv')sankey = hv.Sankey(edges, label='A Breakout of National Health Care Expenditures')sankey.opts(label_position='left', edge_color='target', node_color='index', cmap='tab20')

All you need to do is pass the dataframe containing the three columns into Sankey function of holoviews and then pass some options — what should be the node colour, what should be the edge colour, width and height, colour map etc.

The second option when you use python is the node-index method referred to above. Here, you would need four columns and not three — source index, target index and volume as a dataframe and nodes as a separate dataframe. Now, the node index will anyways start with zero and your data index should reflect the index numbers the node gets. But, notice, I made the same mistake of having it as 100% here because of which the graph got skewed.

Even here, the process is straightforward. Define the source, target, value and additionally, the nodes.

import pandas as pddata = pd.read_excel('data/health-breakup2.xlsx',"data")df_labels = pd.read_excel('data/health-breakup2.xlsx',"labels")source = data["source"].values.tolist()target = data["target"].values.tolist()value = data["value"].values.tolist()labels = df_labels["labels"].values.tolist()

Create Nodes and Flows. One irritant I see here is you will have to give your colour name(I will have to check if there is an alternative or if there is a separate routine through which I can pass random colours).

import plotly.graph_objs as golink = dict(source=source, target=target, value=value,color=["green","gold","red","blue","teal"] * len(source))node = dict(label=labels, pad=15, thickness=5)

Create the Sankey Object and build it. Notice the difference. There you passed only the dataframe; here you pass both the flow and the nodes.

chart = go.Sankey(link=link, node=node, arrangement="snap")fig = go.Figure(chart)fig.show()

Either methods are fine — though I personally prefer a single dataframe containing everything, external data generally comes with indexes.

--

--

Balakrishna Ch

A jack of all trades still trying to understand what he is a master of