All about Alluvial Diagrams

Arnav Saxena
6 min readOct 29, 2021

--

(and implementing them in R using ggalluvial)

As part of my masters in data science program at Columbia university, I’m taking this course called Exploratory Data Analysis and Visualization. Very recently, the instructor introduced us to a very interesting graph type called the alluvial diagram. Through this article I would like to shed some light on the same. The idea would be to give you all some respite from the ubiquitous bar charts, histograms, scatter plots, and line charts and introduce you all to this somewhat lesser known data storytelling technique. Towards the end, we will also take a brief look on how to chart these visualizations up using the GGAlluvial package in R. I can confirm that you can build these visualizations in Python, Tableau, and even MS Excel as well (will leave links to some resources towards the end)

What are Alluvial plots and when to use them?

An alluvial diagram is majorly used to show associations between categorical variables. They provide us with ways to graphically display:

  1. flows between categorical variables or
  2. flows over time/phase

Data flow between categorical variables:

Let’s take a look at how a simple alluvial diagram can help us explore the popular titanic survival dataset.

As you can see the diagram above helps us visualize the flow of data (passengers survived vs perished) across variables such as Class, Sex, and Age. This diagram beautifully illustrates what clusters of passengers survived that catastrophe and which of them unfortunately perished. You can see that since females were given the priority while evacuation, most of them, except the ones travelling in the 3rd class coach survived. You can also see that no female was traveling in the crew class.

Data flow over time:

Source: New York Times

NYT published this research in March of 2018 where in they showed how black and white boys raised in similarly wealthy families get segregated in different economical sections of the society as they grow up.

Why is it called an Alluvial diagram?

I hope the header image might have given an idea as to why are these diagrams called Alluvial plots. In case it didn’t here’s what wikipedia has to say:

“In allusion to both their visual appearance and their emphasis on flow, alluvial diagrams are named after alluvial fans that are naturally formed by the soil deposited from streaming water”

Components of an alluvial diagram

Before we jump to building an alluvial diagram it’s important to understand it’s basic components first. Let’s use our titanic example for the same.

  1. Axes: On the X-axis we show the different states of being. The flow or movement is shown between these states. In our example — Class, Age, and Sex are axes
  2. Stratum: The various categories within axes are called the stratum. For eg, the stratum within the Class axis are First, Second, Third, and Crew
  3. Alluvium: a single alluvial fan or x-spline that pans the entire graph and shows movement across all the variables
  4. Flow: The segments of the alluvia between pairs of adjacent axes are flows. For eg. The x spline denoting flow from crew to male to adult is an alluvium consisting of two flows - crew to male and male to adult

GGalluvial:

GGalluvial can be considered an extension of GGPlot to draw these alluvial diagrams in the tidyverse environment. Simply put we can place a layer of alluvium and stratum on top of vanilla ggplot and our plots will be ready.

GGalluvial can take two formats of data as input to draw the alluvial diagrams.

Alluvial form

Imagine this form to be a tabular representation of an ideal alluvial diagram. Here in each row represents an alluvium and the columns represent the axes. The following code creates an alluvial diagram for the “majors” dataset in alluvial form. The dummy dataset below is in alluvial form:

##   Class1 Class2 Freq
## 1 Stats French 30
## 2 Math French 5
## 3 Stats Art 45
## 4 Math Art 20

Lodes form

Imagine applying ‘gather’ from the ‘dplyr’ package (or pivot if you come from the MS Excel background) to the alluvial form, the resultant data format will be in the lodes form. The data above would look like this in the lodes form:

##   Freq    x stratum
## 1 30 Class1 Stats
## 2 5 Class1 Math
## 3 45 Class1 Stats
## 4 20 Class1 Math
## 5 30 Class2 French
## 6 5 Class2 French
## 7 45 Class2 Art
## 8 20 Class2 Art

Code:

# Importing required libraries
library(ggplot2)
library(ggalluvial)
# Importing dataset
data(majors)
head(majors)
# student semester curriculum
#1 1 CURR1 Painting
#2 2 CURR1 Painting
#3 6 CURR1 Sculpure
#4 8 CURR1 Painting
#5 9 CURR1 Sculpure
#6 10 CURR1 Painting

“majors” data imported above contains data of 10 students as they select their majors across 8 semesters. As can be seen above, the data seems to be in the lodes form. We can use is_lodes_form() as shown below to check for the same

is_lodes_form(majors, key = "semester", value = "curriculum", id = "student")#TRUE

This confirms that the data is in the lodes form.

Let’s see how to plot an alluvial diagram for data in the lodes format. We might be interested in seeing how students’ preferences changed while moving from one semester to another.

ggplot(majors, aes(alluvium = student, x = semester, stratum = curriculum)) + 
geom_alluvium(color = "black") +
geom_stratum( color = "black",aes(fill=curriculum)) +
# Vanilla GGplot here onwards
ggtitle("Majors opted across semesters")+
scale_y_discrete() +
ylab("Number of students enrolled")+
theme_bw()+
theme(axis.text = element_text(size = 7))

Code decoded

Since we were interested in watching the flow over semesters, we had x= semester. Next, we decided that each student should represent one alluvial so that we can track his major preferences across different semesters. Lastly, we decided to keep the subjects as stratums for every semester — the size of the stratum denote how many students took a particular major that semester

Note that geom_stratum just add the stratums on the cartesian plane while geom_alluvium adds the alluvial layer on top of it. Consider them to be layers placed on top of each other to create the complete alluvial diagram.

Next, let’s try converting the data to alluvial form. We can use to_alluvial_form() to do the same.

majors_alluvia <- to_alluvia_form(majors,key = "semester", value = "curriculum",id = "student")
is_alluvia_form(majors_alluvia, tidyselect::starts_with("CURR"))
head(majors_alluvia)

As we can see, the data is in the alluvial form with every column representing the different axes. Now to plot an alluvial plot using this data we used the following syntax.

ggplot(majors_alluvia,aes(axis1 = CURR1, axis2 = CURR7, axis3 = CURR13))+
geom_alluvium(color= "black",aes(fill=as.factor(student))) +
geom_stratum() +
geom_text(stat = "stratum", aes(label = after_stat(stratum)),size = 3,discern=TRUE)

Code decoded

Here in we simply provide the name of axes and R picks up the stratum and the alluvium from the data itself since the dataset is simply in the same format as the plot.

Hope this article would have added another tool in your data visualization arsenal. Below are the sources where you can learn making these diagrams in other languages and tools.

Sources for making alluvial diagrams in other tools:

  1. Python:https://plotly.com/python/sankey-diagram/
  2. Excel: https://exceloffthegrid.com/sankey-diagram-in-excel/

--

--