User journey (sankey) diagram

Generate a user journey diagram using python

Published in

Multiply

7 min readOct 10, 2019

This is the fourth part of the series Mobile Analytics using Python, describing how to calculate your key app usage metrics using nothing but python.

Code snippets provided in each section, and the Github repo can be found here.

Objective

In this section we will be mapping out the most dominant user journey flows by first defining the starting point (event).

Why bother?

As mentioned in the previous section, it’s quite usual for user journeys not to be unidirectional. A user can navigate around the app, carry out different actions in whatever order and end up at a specific point through different routes.

Mapping out journey flow diagrams helps in identifying and visualising the most dominant ones. This helps in raising insights around UX design, but also identifying which features of the app your users use the most.

Starting Point

Again, I will assume that you have your data in a pandas DataFrame, named events, with the following columns:

distinct_id: distinct user id
name: event name
properties: a dict object with the properties associated with the event
time: timestamp at which the event took place

Dummy data for illustration purposes (Note: “properties“ not needed for this section)

Requirements

Python (I am using v3.7)
Pandas: for data manipulation
Plotly: for visualisations*
A note on Plotly: One could use matplotlib instead, but way more lines of code would be needed to make the plots look pretty, be annotated and offer interactivity. We will be creating plots that are easy to dive into and share with other people.

References/ Logic

Some of KenLok ’s code will be used here as the backbone for converting a DataFrame to one that plotly can use as input to draw the sankey diagram.

We will be adding onto his Ken’s code and include all the code for transforming our raw events dataframe, and to avoid circular reference, since the user journey is not unidirectional. If you do not address this point, the plotly diagaram will not render.

The logic is basically ordering all the events in time order, defining the starting point for the user flow and look for the first time that each user did that. Our user flows will always have one event in the first step of the diagram.

For the subsequent possible steps, we will display the top n events per step, with the rest being grouped together in an “Other” block. The number of events per step and number of steps will be both function parameters that you can choose.

Note that the step number is appended to the beginning of the event name to avoid circular reference

Code

The code for generating the user journey flow should take into account the time order of the events.

Methodology to be implemented:

Define the starting_step of the user journey and return the first n_steps for each user starting from this event
If fewer than n_steps were performed, then mark the end of the journey with “End”
For each step, find the top events_per_step, and group the rest of the events in this step into an “Others” block
Transform the DataFrame, returning “event_n”: “event_n+1” pairs with a count for each

1) Define a starting point and return the first n_steps for each user

First, we need to define an event starting point. All the journeys will branch out of this one. We will write a function that looks for the first occurrence of the starting_step in a list of events and returns the n_steps first events. This function will then be called using an apply method on the events DataFrame.

Using the function above, we can transform our raw events DataFrame, such that we have one row per user, with the columns having the events in chronological order for that user.

Running the above with starting_step = ‘EventId: 1’ and n_steps = 4 gives:

2) If fewer than n_steps, mark end of journey with ‘End’

As seen from the output of the previous section, for some users not all of the n_steps exist, and thus we have NaN values. These will be replaced with ‘End’ which will mark the end of the journey for these users.

As seen in the first row, we would have the last two columns filled in with ‘End’. This is not an issue for us, as later we will be converting the DataFrame to source:target pairs, and filtering out the ‘End’: ‘End’ pairs.

We have now addressed the end of journeys for users. The other challenge is the circular reference of events. An example of this would be a user updating his/her address in the profile, making a purchase and then updating his/her address again.

The way I addressed this is appending the step number to the beginning of the event name in each column/step. This allows the same event to show up in any of the steps of the user journey, reflecting the true journey.

Incorporating the above, the flow DataFrame looks like:

Step number added as a prefix to event name. Note that the prefixes start from 1, as opposed to 0 as is the case for the column numbers

3) Find the top events_per_step

Since there are numerous possible paths at any point, we will be returning the top events_per_step and grouping the rest into an “Other” block. This avoids having too many nodes in the diagram which would be distracting.

Running the above, five users whose journey did not fall under the most frequent paths would look like:

We can now count identical journeys across users:

With the output being:

Wrapping everything up in a second function:

4) Transform the DataFrame to count source:target pairs

First, we will create the list of node labels and colours that will be passed to the sankey plotting function later. The label list will also be used when transforming the DataFrame to source:target pairs.

Nothing interesting here. The output is just two lists:

['1: EventId: 1', '2: EventId: 32', '2: Other', '2: EventId: 65', '2: EventId: 2', '2: EventId: 66', '2: EventId: 67', '2: End', '3: EventId: 32', '3: EventId: 65', '3: Other', '3: EventId: 62', '3: EventId: 2', '3: EventId: 67', '3: End', '4: Other', '4: End', '4: EventId: 67', '4: EventId: 3', '4: EventId: 62', '4: EventId: 2', '4: EventId: 63']['blue', 'blue', 'grey', 'blue', 'blue', 'blue', 'blue', 'blue', 'blue', 'blue', 'grey', 'blue', 'blue', 'blue', 'blue', 'grey', 'blue', 'blue', 'blue', 'blue', 'blue', 'blue']

We can now transform the DataFrame to source:target pairs with a count column:

The output so far looks like:

Using the label_list, we can add the node index for both the source and the target, as to plot the sankey diagram we need 3 lists: a list of input nodes, a list of output nodes and the link size (count).

At this point we will also filter out any rows where both source and target are equal to ‘End’, as they are no such links in reality.

Wrapping everything in a function:

Visualisation

We have already done all the hard work. All that is left is producing a plotly sankey chart, so let’s wrap everything in a single function to do exactly that.

In this function, on top of defining the node sources and destinations along with the size of each, we are also dynamically defining the plot width, so that steps are evenly spread out and the node labels are showing without being rendered on top of each other. If you choose to show 5 steps or more, you will be able to scroll horizontally.

This was generated using: