Sankey Diagram In SQL

Kadek Byan Prihandana Jati
6 min readSep 27, 2020

--

(If you are just looking for SQL reference, you could scroll to down of the page to get the snippet of Sankey Diagram in SQL).

Based on R Graph gallery, A Sankey diagram allows to study flows. Entities (nodes) are represented by rectangles or text. Which, when we want to know the user journey or overseeing flow of our application requests, we could use Sankey Diagram.

In this article, I want to show you how to create the sankey diagram on redash and the logics of data behind it.

The study case here is we want to show user flow through an application, so we need to know how many session that 1 user have in 1 day and what pages are visited in that 1 day sessions.

Sankey study case

Data Preparation

Stage 1: Raw session visit data

When you want to inspect the user flow, you do have a lot of session in 1 day from a lot of users that visit a lot of page combination. It could be they visit Home to Page A to Page B or from Home to Page C to Page D.

Table 1.1 This table could represent session in 1 day and the page visited

On the table 1.1, we have a lot of different visitor_id and session_id which access several pages at certain time. After this, we will transform that table into a more aggregated table, before we could transform the data into sankey diagram.

Stage 2: First aggregation from raw sessions

At this stage, we want to find a session start and session end per user_id per session_id. So we need to aggregate the raw sessions data first into table 1.1

Table 1.2 Aggregated version from table 1.1

On table 1.1 we have the information of session time per session_id, so by knowing this information we could gather more information about what kind of page that session_id has visit from session_start to session_end.

Stage 3: Find what page is visited given session_start and session_end

By having table 1.1 and table 1.2, we could perform a join table to find what page is visited given session_start and session_end. Because table 1.1 have the detailed page visited occurred at certain time per session_id and table 1.2 have the information about the start_time and end_time of session.

Table 1.3 page visited per session with number order

Having the table 1.3, we have the information that given 1 session, user has visited page A and page B in ordered event number. The interpretation is, user_id 0078321 in 2nd session has access page B and then page A (because page B is on event_number 1).

Stage 4: Group 1 session with first 5 events

Table 1.4 1 user_id has a number of session with 5 sequence events

This is the last transformation before we go to sankey definition. In table 1.4 we already map the session number to its first 5 events. The e1 column is represent max event in table 1.3 from event_number = 1, it might be on table 1.3, certain user has a lot of events occured in the same timestamp.

After this, we will do the final transformation to get the final form of sankey diagram data.

Table 1.5 Final form of data to shown as Sankey Diagram.

By having table 1.4, we already have a lot of combination from the first 5 events. So the final transformation is just group that combination across all users and count(*) the number of combination record across all users to be the value of column “value” at table 1.5.

Visualizing Sankey

Image 1.2 Sankey using table 1.5

Voilla! The Sankey Diagram is built upon table 1.5. There are also several points which need for me to add before creating a Sankey Diagram:

  1. Make sure your session and page event order is make sense. There are some cases when 1 session has unordered event pageview. If the pagview order is wrong, your Sankey visualization will be wrong to represents the flow of your users.
  2. Always make the Sankey Arm thicker by grouping some pages or the nodes to connect. If there are some connection between pages, we could create 1 group of page to make the Sankey Arm thicker and easier to read.
  3. Sankey on redash just limited for 5 stage visualization. If you want more, maybe you could use R or Tableau to create more stage of visualization. But for me, the more stage that you have will not add more value or insight to your visualization.
  4. Also make sure your data is clean. In my process to create this Sankey Diagram, there are a lot of record which dirty and needs to be cleaned like the app giving burst events when the app is launched that makes our Sankey is thicker at first.

If you have any question, just put some comments in comment section below.

References:

  1. https://discuss.redash.io/t/flow-sequence-queries-sankey-and-sunburst-visualizations/1703 (I get to learn the query and fit to my context from here, here the query is trying to create their own session).
  2. If you are looking for SQL references (also from the 1st reference), you could paste this SQL on redash (if you have one) and point the table to redash table.
WITH
events AS (
select user_id,
action || ' ' || object_type as event_name,
created_at as occurred_at
from events
where created_at > current_date - 30
and not ((action = 'view' and object_type='widget') or (action = 'view' and object_type='visualization')
or (action = 'view' and object_type='query')
or action = 'execute'
or action ='execute_query')
),
sequences as (
SELECT e1,
e2,
e3,
e4,
e5,
COUNT(*) AS value
FROM (
SELECT user_id,
session_number,
MAX(CASE WHEN event_number = 1 THEN event_name ELSE NULL END) AS e1,
MAX(CASE WHEN event_number = 2 THEN event_name ELSE NULL END) AS e2,
MAX(CASE WHEN event_number = 3 THEN event_name ELSE NULL END) AS e3,
MAX(CASE WHEN event_number = 4 THEN event_name ELSE NULL END) AS e4,
MAX(CASE WHEN event_number = 5 THEN event_name ELSE NULL END) AS e5
FROM (
SELECT e.user_id,
e.occurred_at,
s.session_number,
e.event_name,
ROW_NUMBER() OVER (PARTITION BY e.user_id, s.session_number ORDER BY e.occurred_at) AS event_number
FROM (
SELECT user_id,
occurred_at AS session_start,
COALESCE(LEAD(occurred_at,1) OVER (PARTITION BY user_id ORDER BY occurred_at),'2020-01-01') AS session_end,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY occurred_at) AS session_number
FROM (
SELECT user_id,
occurred_at,
EXTRACT(
'EPOCH' FROM occurred_at - LAG(occurred_at,1) OVER (PARTITION BY user_id ORDER BY occurred_at)
)/60 AS time_to_last_event
FROM events
) bounds
WHERE time_to_last_event > 30
OR time_to_last_event IS NULL
) s
JOIN events e
ON e.user_id = s.user_id
AND e.occurred_at >= s.session_start
AND e.occurred_at < s.session_end
) x
GROUP BY 1,2
) z
GROUP BY 1,2,3,4,5
ORDER BY 6 DESC
LIMIT 100
)
select * from sequences limit 100

--

--