Network graphs analysis (Part 1 of 2): Introduction

Published in

Data Science at Microsoft

6 min readApr 6, 2021

A few years ago, I was binge reading the Game of Thrones books and found myself having a hard time keeping track of all the characters in my head. (This is not surprising — there are more than 150 named characters in the series!) I was going back and forth between chapters or constantly looking up the wiki A Song of Ice and Fire to remember plotlines. I needed a mental map — surely there was a better way to visualize these characters? And behold: The Internet rewarded me with this incredible Network of Thrones analysis, published by Andrew Beveridge and Jie Shan. Thus started my fascination with network graphs — I was intrigued by the possibility of visualizing complex relationships among fictional characters as interconnected networks.

Even outside of pop culture and literature, network graphs can be a powerful — albeit sometimes underrated — tool within data science to analyze datasets. In the first of this two-part series of articles, I discuss the basics of network graphs and some real-world applications. In the second article, I walk through a step-by-step example of how you can build your own “social network” using Hamilton characters as an example (yes, that Hamilton) with R’s igraph library.

But first, what are network graphs?

Diving into the basics

Pictured here is a sample network graph from Wikipedia that illustrates contributions of Wikipedia editors to different languages. Using this example, here are some basics (or a quick refresher, if you’re already familiar) of graph theory concepts:

Circles representing the languages in which articles were written are the “vertices” of the graph (interchangeably, the “nodes”).
The “edges” are the lines connecting each pair of vertices. Each edge in the graph is determined through an incidence function that maps a pair of vertices to an edge.

In this example, each edge represents (by line weight, or thickness) the number of editors who have contributed to both the languages that the line connects.

This is what we call an undirected simple graph — “undirected” meaning {en--> fr} and {fr --> en} are identical, and “simple” meaning no more than one edge connects each pair of vertices. The graph is also “weighted,” meaning that the thickness of the edges is relative to the strength of relationship between the vertices. In this example the weighted incidence function could look something like this:

While the visual representation of graphs in this way is an intuitive approach to quickly show relationships so that they are easy to comprehend, there are even richer insights we can derive from representing a dataset as a graph object. Centrality is a key concept in graph theory to identify the significance of nodes:

Degree centrality: This is a measure of the number of edges connected to each node.
Eigen centrality: This represents a measure of how “well connected” a node is, how many links connections share, and so on through the network. It identifies nodes with influence over the whole network, not just those directly connected to it.
Betweenness centrality: This is literally how much a given node is between other nodes and acts as a “bridge” among various clusters of networks. It’s a measure of the “influence” of each of the vertices on the rest of the network.

Applications of network graphs

Business applications of network graphs are numerous:

Social networking sites make use of network graphs to create communities of similar users and offer targeted recommendations. A rudimentary implementation of the algorithm behind a “suggested friends” feature could look something like this: “Nine out of ten of Alice’s immediate friends are also friends with Bob -> recommend Bob as a potential friend for Alice.”
Applications that map the shortest distance from place X to place Y (such as maps, ride-sharing services, supply chain and logistics for delivery trucks, and so on) likely use variants of “shortest path” algorithms, popularly known in computer science as the traveling salesman problem.
Network theory is a crucial component of lexical and semantic processing within natural language processing (NLP), in turn used among chatbots and virtual assistants like Alexa, Cortana, Siri, and even IBM’s Watson winning Jeopardy!, a game of puns and wordplay that is far from straightforward.
Name-dropping party games like Six Degrees of Kevin Bacon that use network graphs.
In epidemiology, centrality measures may be used in identifying origins of pandemics or “super spreader” events.
If you think about it, the Internet is simply one gargantuan network of different websites. Search engines make use of knowledge graph measures to return the most relevant pages for a particular search query.

A business example: Grouping similar features using a network graph

Within the Microsoft Endpoint Manager data science team we have experimented with network graphs to make sense of certain complex datasets. Here’s an example:

Problem statement

Our dataset was a “wish list” of new features requested by existing customers. Each of the requested features had been manually entered into a CRM system by product managers working directly with customers. As is the case with any unstructured text data, the dataset was subjective and highly contextual and super tricky to parse. Different product managers talking to different customers may have entered their own description of a feature, but the desired outcome was ultimately the same. From a planning perspective, it was important to understand which features needed to be prioritized for engineering investment, as well as identify duplicate and related items to make sure development effort was not duplicated.

Implementation

First, we built an nXn sparse-weighted adjacency matrix for the number of features (n being the total number of requested features). A similarity index (SI) is defined for each pair in the matrix, based on how many similar customers had requested the same two features. We can hypothesize that the higher the SI number, the more related are the two features.

We then constructed a graph object from the adjacency matrix and plotted it. (I dive into this in more detail in Part 2, but I’ve included a few notes immediately below.)

Each circle represents the features (F1, F2, F3).The thickness of each edge is relative to the SI.

When we presented this graph, our stakeholders were able to quickly discern features that are strongly correlated. It is intuitive and easily interpretable — and prompted an immediate discussion on where it made sense to synchronize product development efforts.

Summary

I hope this article underlines the value that network graphs can provide in helping you understand your customer. In the forthcoming Part 2, I go into more detail, demonstrating how to transform a flat dataset of Hamilton lyrics into a social network graph of the characters.

Iswarya Murali is on LinkedIn.