Workable method of longitudinal data visualization

Published in

Inside the Tech by SoftServe

7 min readFeb 1, 2021

--

Data visualization is a tall order. Especially for eHealthcare project, that may contain too much data of different types. Well-known methods not always come in handy. Here`s a step-by-step guideline to workable alternative.

Background

While working on eHealthcare projects we measure many features at different times. E.g., in a clinical trial, we could have a thousand patients to check blood pressure (BP), biochemistry, pulse, etc. weekly throughout few months.

Then we need to visualize all the gathered data to gain insights or to present results. And there is only a handful of common ways to do it:

Individual line trends
Boxplots
Violin plots
Sankey plots

All those methods have their flaws.

Individual trends could be a mess, especially if you have a lot of patients. Boxplots could be misleading and lose to violin plots almost in everything (e.g., 5 reasons you should use a violin graph). Violin plots do not show transitions between the states and sankey plot could be hard to read.

Why data visualization matters? Real-life case

Once we were working with a Fortune 500 list company on improving the treatment protocol for a certain chronic condition. The client had all the patient data divided into two clusters (which seemed pretty natural for such case):

Responding to treatment
Non-responding to treatment.

However, having taken a closer look from the different timescale, we understood that they had two more groups of patients:

Who responded to therapy at first but failed at the end
Those who failed at first but responded at the end.

It was valuable insight with a great business value, that turned the whole project in a totally different direction.

Our challenge here was to make it clear for the client. No question that a poor presentation of the finding would lead to misunderstanding, client`s dissatisfaction, and reputational losses. Moreover, the decision makers usually don’t have the time and/or expertise to dive into the tangled DS Jupyter notebooks or series of complicated plots for answers.

That`s where a proper data visualization comes in. And it totally worked for us. Below I`ll walk you through our approach which we have called Vyshyvanka plot.

This is my first article on Medium which I created personally, so I was not aware how actually bad this platform fits for publishing technical articles with a lot of code (no syntax highlighting, no formulas, no spoiler blocks, no footnotes, etc.). Basically, if one wants to paste highlighted python code he has to create a gist and embed it here, which has at least two bad consequences: the author will be left with a ton of gists, readers will have long unfolded sheets of code to scroll through.

I decided to carve out all code to perceive readability. The occasional reader who wanders in here may find all code in the accompanying repository or read the original article on hack.md.

Use case data generation

Consider the case: you have a thousand patients whose BP was measured 6 times at certain time points. You decide to group your BP results in 5 groups (1...5) by range, where 1 could represent the ‘very low’, and 5 ‘very high’. And now, you want to visualize results.

Here I generate a completely artificial dataset for illustrative purposes, with the very strange behavior of BP levels.

It looks like this (columns are patients, rows are points in time, cell values are BP group that Patient’s BP fell into at a given time point):

Why Vyshyvanka plot?

Let’s plot generated data using each of four common methods:

https://gitlab.com/banderlog/vyshyvanka_plot_article/-/blob/master/vyshyvanka_examples.ipynb

We may see that individual trends are a mess, boxplots are not very informative and a violin plot shows us that patients are heavily distributed between 2nd and 4th groups and their distribution does not change significantly over time.

But this is simply not true, we created the dataset in a way where each patient has an equal chance to appear in the 2 or the 4 group at the next step. Will Sankey plot help us?

https://gitlab.com/banderlog/vyshyvanka_plot_article/-/blob/master/sankey_plot.ipynb

Well, a sort of (here N_ numbers stands for BP group and _M numbers stands for certain time point, e.g. 3_2 is 3rd BP group at 2nd timepoint). We are able to see some transactions, but the plot is too fiddly and obviously needs some tuning. You may find a separate chapter dedicated to Sankey plot problems on python below.

Meanwhile, compare it with a better way to visualize this:

It shows BP group transition over time. The line width and markers opacity (black dots) correlates with the number of patients. Here we can see that a majority of patients started in the 2 and the 4 groups and these groups have a stable amount of patients during the whole period of observation. But it also shows that there were 2 types of transitions with relatively equal probabilities: to remain in the initial group or switch between 2nd and 4th, with a small chance to switch to group 1, 3, or 5. Here we do not see patients’ individual pathways, rather the transition probability trends.

And this could be a pretty valuable insight. Also, it is a convenient plot to present and explain.

We have invented this plot almost by accident under the pressure of practical needs during our work with the real customer. Our Lviv teammates have named it “Vyshyvanka plot” due to similarity of appearance with the national Ukrainian costume pattern.

Python problems with Sankey plot

There are not so many python plotting libs (link1, link2) and only a few that could do Sankey plots — matplotlib, holoviews and plotly (also there is a jupyter widget ipysankeywidget). But only one lib could create beautiful enough Sankey plots — plotly (through JS frontend). Let’s plot our generated dataset again:

To make it more intuitive we must force-order our BP groups like on Vyshyvanka plot, thus we will be able to keep track of group transitions much more easily:

Despite that on this example everything seems OK, plotly has a lot of problems with changing default nodes order, which makes it hard to use it on most data (see examples). Consider these github issues (especially the last one):

More examples

Here I’ll be plotting violin, vyshyvanka, and Sankey plots only because they are obviously superior to individual line trends and boxplots. Also, to make it more general, we will discontinue the BP groups example and just talk about state transitions over time.

You may find all relevant Jupyter notebooks with all code from this post here.

State distribution changes with mode of 1->5 transition

State distribution changes, bi-modal 1->5 and 5->1 transitions

State distribution splits from initial 3 flowing to either 1 or 5 ultimately

Interactive Vyshyvanka plot

In rare cases, you might want to make it interactive so that to be able to know the exact number and percent of transitioned states by clicking on a line. Also, interactive mode allows for finding optimal values for opacity and line widths. For this, we provide you with the code below. Keep in mind, that it will work in Jupyter notebook or a Jupyter-lab upon installation of mplcursors and ipywidgets (be careful with virtualenv), next try that notebook.

You should get something like this: