Sankey Diagram Visualization

Published in

Splunk Engineering

10 min readMar 28, 2022

Introduction

This article talks about the Sankey visualization component published as part of Splunk’s visualization package (@splunk/visualizations). Here we will talk about:

An overview of Sankey diagrams and its usages
The technical details around the implementation
Steps to consume Sankey component with the explanation of the different customization options
Editor configurations possible while consumed via Splunk Unified Dashboard Framework.
Details about customization possibilities

What are Sankey Diagrams?

A Sankey diagram is a flow visualization used to depict the directed flow between nodes in a network. This visualization type is great for representing flows or processes and seeing the relative share.

Image: Sankey diagram visualizing the flow of charity funding in non-profit organizations.

Sankey diagrams consist of three sets of elements: the nodes, the links, and the link values. The entities that are being connected are called the nodes and the connections between the nodes are called the links. The quantity of flow on each link is represented by the link values. The width of each link is proportional to the flow rate, so if a link’s width is twice as wide as another link it represents double the quantity of flow.

Sankey diagrams are best used for representing:

Many-to-many mapping between two domains (such as universities and majors)
Multiple paths through a set of stages (how traffic flows from pages to other pages on a website)

Depending on the industry sector, data analysts and researchers use Sankey diagrams to visualize data and gain quick insights for different use cases. In the science and engineering field, it is a standard model to represent heat balance, energy flows, and material flows using Sankey diagrams. In the information technology field, a common use case is to analyze network flows and web traffic using this visualization.

Technology Stack

Here is the technical stack used for implementing the Sankey visualization component:

Typescript — Primary programming language adopted by frontend teams who own large and medium-sized projects with its ability to do static type checking and better modularity for larger projects.
React — Framework used to implement client-side apps/components.
SVG — For rendering individual elements:- nodes, links, text labels.
Sankey layout plugin — Implementing the layout for the Sankey component based on the input data and determining the node’s height, position, and the shape of the links (which are SVG paths with a combination of lines and splines) is the most challenging aspect of Sankey visualization. Many third-party plugins were evaluated, such as d3-sankey, Sankey plot from google-chart, plotly, and echarts. Based on the feature parity with our near-term and long-term requirements, we’ve used an extended version of d3-sankey with some refactoring that has support for backward linking and self-loops in the layout algorithm.
Splunk React UI components — Splunk has a library of UI components that implement the Splunk design language in React (@splunk/react-ui). It comes with default support for Splunk themes and type definitions. Some basic UI elements like text, tooltips are rendered using this package.

In a nutshell, the Sankey component uses React for the rendering layer and data binding. Internally, it uses the d3-sankey algorithm for the link path generation and positioning of the nodes, which finally gets rendered as SVG elements.

We could have also used pure d3 rendering instead of React rendering for gluing the SVG/DOM elements. But the advantages of using React for rendering versus pure d3 based rendering are

There is more control over the rendering logic.
There are performance advantages of using React virtual DOM rendering with SVG, only the DOM nodes that need an update with data change gets re-rendered.
The React prop changes and the reactivity will be much cleaner without any forced updates and re-render.
There is more flexibility for styling (For example styled-components offer a great way of component-based styling in react).
In the future, switching to a high-performance canvas layer instead of SVG internally for a larger volume of data can be handled smoothly using the same react rendering layer externally without any major refactoring.

Steps to consume Sankey

Splunk end-users and application developers can consume the Sankey component using one of these two options:

Installing the npm package (@splunk/visualizations) within any React-based consuming web application. This option is suitable for application developers.
Using Splunk’s Dashboard studio, which has come built-in with every Splunk Enterprise and Splunk Cloud Platform release, starting from 8.2.2109 release. This option is for Splunk end-users consumption.

Using NPM

Install @splunk/visualizations within the React app:

Install peer dependencies:

npm install react@^16 react-dom@^16 styled-components@5 @splunk/visualization-context --save`

2. Install the visualizations package:

npm i @splunk/visualizations

Consume it as below within a JSX component.

Image: Code to consume Sankey within JSX component

In the source code above, the ‘dataSources’ property is used to assign the data source for rendering the Sankey component:
- The ‘dataSources’ property key ‘primary.data’ represents the data of network activity of page navigation in a website.
- Inside the ‘primary.data’ object, key ‘columns’ represent a two-dimensional array of navigation links between each website’s page.

In the ‘columns’ array:

columns[0] represents the list of starting page name of each link.
columns[1] represents the list of target page name of each link.
column[2] represents the navigation counts from the source page to the target page.

The ‘fields’ array represents the corresponding column field names.

This results in the following Sankey:

Image: Visualize network traffic using Sankey diagram

Here is another simpler example with a full dataset in the code:

Image: Code to consume Sankey within JSX with sample data

Image: Default rendering of Sankey using the code above

In order to build web applications in an accelerated way, Splunk also has UI toolkits as part of Splunk Web Platform. The package @splunk/create generates scaffolding for minimal React components and applications. Consumers can also bootstrap a React project using @splunk/create cmd, which installs all the dependencies like @splunk/react-ui, @splunk/themes and creates a minimal app using SplunkThemeProvider and Splunk UI components. Just add the additional package dependency for @splunk/visualizations to the project if Sankey or other visualization components from the library need to be consumed within the app.

Generate a Sankey diagram in Splunk Dashboard Studio

Follow these steps to add Sankey diagram visualizations using Splunk Dashboard Studio:

Select the Sankey diagram using the visual editor by clicking the Add Chart button in the editing toolbar and either browsing through the available charts or by using the search option.
Select the chart on your dashboard so that it’s highlighted with the blue editing outline.
Set up a new data source by adding a search to the Search with SPL window.
To select an existing data source, close the Configuration panel and reopen it. In the Data Configurations section, click +Setup Primary Data Source and click +Create Search to create a new search from this window. You can also choose a new ID that describes the search better than the default.

Input DataSource Format

The input data is set using the `primary.data.columns` key in the `dataSources` option as a two-dimensional array. By default the first two indexes of the columns array (column[0] and column[1]) will be treated as the source and target node arrays respectively. The column index, which has the first number type, will be by default the link values array. The `fields` array indicates the array of field names assigned to each column array.

import React from 'react';import Sankey from '@splunk/visualizations/Sankey';export default () => (<Sankey    dataSources={{        primary: {           data: {             columns: [               ['Oil', 'Natural Gas', 'Coal', 'Coal', 'Fossil Fuels', 'Electricity'],               ['Fossil Fuels', 'Fossil Fuels', 'Fossil Fuels', 'Electricity', 'Energy', 'Energy'],               [15, 20, 25, 25, 60, 25],             ],             fields: ['Source', 'Target', 'Value'],          },        },      }    }  />
);

In the example above, the input data can be visualized in the form of a table, with each row representing a link (source, target, value).

For those who are using Splunk’s SPL query for data source within Dashboard visualization, the `table` command comes in handy to create the data format above:

index=_internal
| where isnotnull(bytes)
| table host sourcetype bytes
| head 100

Customizable options

There are several options that are customizable for Sankey like backgroundColor, linkOpacity, and linkValues. You can refer to the options tab in Sankey documentation for more information.

All the options support `dynamic option` strings using our visualization DSL (Domain Specific Language). This allows options to bind to data dynamically and provide a rich data-driven visualization configuration experience. More read about DSL in this link.

For example, the option `linkValues` has a default dynamic option DSL string set to `> primary | seriesByType(‘number’)`. This automatically binds the linkValues option to the first columns array, which is of number type. To bind the `linkValues` using a specific fieldName — ‘value2’, change the DSL string to `> primary | seriesByName(‘value2’)`.

<Sankey
  
 options={{   linkValues: '> primary | seriesByName(\'value2\')' }} dataSources={{ primary: {  data: {    columns: [    [ 'Oil','Natural Gas','Coal','Coal','Fossil Fuels','Electricity'],    ['Fossil Fuels','Fossil Fuels','Fossil Fuels','Electricity', 'Energy','Energy'],    [15,20,25,25,60,25],    [1,20,3,4,50,6]   ],   fields: ['source','target','value','value2']}}}/>

Coloring modes

There are two types of coloring modes supported by Sankey

Categorical
Dynamic

The option ‘colorMode’ specifies the coloring method used for the nodes and links.

Categorical

When ‘colorMode’ is set to ‘categorical’ the nodes and links will be colored based on the ‘seriesColors’. This is the default set colorMode. Use the colorMode ‘categorical’ if to color the Sankey based on each unique node. Each node will be colored based on the colors assigned in the ‘seriesColors’ list. The links will be colored based on the source node’s color.

There is a default ‘seriesColors’ list set for coloring and it can be configurable too.

Code for categorical coloring and seriesColors override:

<Sankey   options={{    seriesColors: ['#9980FF', '#45D4BA', '#FB865C', '#66AAF9',     '#E85B79', '#88EE66', '#F0B000'],   }}  dataSources={{   primary: {    data: {    columns: [     ['Oil', 'Natural Gas', 'Coal', 'Coal', 'Fossil Fuels', 'Electricity'],     ['Fossil Fuels', 'Fossil Fuels', 'Fossil Fuels', 'Electricity', 'Energy', 'Energy'],     [15, 20, 25, 25, 60, 25],    ],    fields: ['source', 'target', 'value'],   },  },}}/>

Dynamic

When ‘colorMode’ is set to ‘dynamic’, links are colored dynamically based on the dynamic string assigned to ‘linkColors’. By default, linkColors are assigned the default DSL string `> linkValues | rangeValue(linkColorRangeConfig)`, What this means is that, depending on the individual link value in the `linkValues` array, the color can be decided based on the `linkColorRangeConfig` setting. The `linkColorRangeConfig` setting can be passed in as part of the `context`. This is helpful when you want to showcase a second numeric dimension or highlight the first one based on specific thresholds.

Code for dynamic coloring:

<Sankeycontext={{linkColorRangeConfig: [  { to: 20, value: '#D41F1F' },  { from: 20, to: 40, value: '#D94E17' },  { from: 40, to: 60, value: '#CBA700' },  { from: 60, to: 80, value: '#669922' },  { from: 80, value: '#118832' },],}}options={{  colorMode: 'dynamic',}}dataSources={{  primary: {   data: {    columns: [     ['Oil', 'Natural Gas', 'Coal', 'Coal', 'Fossil Fuels', 'Electricity'],     ['Fossil Fuels', 'Fossil Fuels', 'Fossil Fuels', 'Electricity', 'Energy', 'Energy'],     [15, 20, 25, 25, 60, 25],],    fields: ['source', 'target', 'value'],  },  },}}/>

Additional Coloring Flexibility Using DSL

Let us consider an example where we keep the link widths based on the ‘value’ field. However, we want to color the links based on another field such as the ‘rating’ field, which is different from the field that actually determines the link width. You can do so using the following options:

Image: More customized example for coloring mode

<Sankeycontext={{ratingColorConfig: [  { from: 0, to: 0.9, value: '#D41F1F'},  { from: 1, to: 1.9, value: '#D97A0D'},  { from: 2, to: 2.9, value: '#CBA700'},  { from: 3, to: 3.9, value: '#9D9F0D'},  { from: 4, value: '#118832'}]}}options={{  colorMode: 'dynamic',  linkColors: '> primary | seriesByName(\'rating\') | rangeValue(ratingColorConfig)',  linkValues: '> primary | seriesByName(\'value\')'}}dataSources={{  primary: {   data: {    columns: [      ['Oil','Natural Gas','Coal','Coal','Fossil Fuels','Electricity'],      ['Fossil Fuels','Fossil Fuels','Fossil Fuels','Electricity', 'Energy','Energy'],      [15,20,25,25,60,25],      [1,2,3,4,0.5,5]],     fields: ['source','target','value','rating']  } }}}/>

User Interactions

When you place the mouse cursor over on a node, it highlights all the links and nodes that are connected to the node. It also displays a tooltip with the node and its link values with the associated source and target nodes.

When you place the mouse cursor over on a link, it highlights the link and nodes that are connected to the link. It displays a tooltip with link values and the associated source and target nodes.

Editor UI

If Sankey is used through Splunk’s Dashboard studio, there is an editor interface that makes it easier to customize all the exposed options. The editor shows up by clicking the visualization component panel in edit mode.

Conclusion

The Sankey diagram is easily one of the best visualizations that can be used to visualize the flow of big data from source to destination nodes through single or multi-levels. The benefits are:

Very useful when we are handling big data and expecting to visualize the flow of data among a lot of parameters.
Helps to uncover inconsistencies in data very easily due to simple, intuitive, and informative design.

I strongly encourage everyone to give it a try and provide your feedback at Splunk Ideas.

To learn more about other visualizations in Splunk’s visualization package, read our documentation.

If you are more interested in exploring more, do check out Splunk’s UI tool kit packages page.

If you enjoyed reading this post and would love to work with us in one of Spunk’s cool product teams, apply to one of our many engineering positions through Splunk’s careers webpage.