Introduction to Graph Analytics

Selen Parlar
Analytics Vidhya
Published in
7 min readOct 9, 2019

In the Introduction to Graphs post, we have made a gentle introduction and we have seen some of the problems that can be modeled using graphs. In this post, we will be talking about social networks and perform some graph analytics on them.

Introduction

Any network with connections between individuals can be considered as social networks. Below is an example social network of male/female actors as nodes that are connected by edges if the two actors performed together in a movie. In this graph, the male/female actors, i.e. the nodes are Tom Hiddleston, Chris Hemsworth, Robert Downey Jr, Brie Larson, Jude Law, Zach Galifianakis, Michael Keaton, and Tom Holland. For instance, both Tom Hiddleston and Chris Hemsworth performed together in Thor movie and they performed with Robert Downey Jr in Avengers: EndGame together.

When we analyze this network we can get some insights about individuals. For instance, Robert Downey Jr has 4 different movies in common with other actors and one can propose that he might be a popular actor. We derive this by visual inspection but, how do we get this implication from a computer? We’ll see.

Creating and Testing a Network

Let’s analyze this network from a Computer Scientist perspective. To do so, first we need to create a graph structure and represent all the nodes and relations of the given network within that graph structure. In this post, to develop and analyze networks we will use NetworkX which is a Python package for the creation, manipulation, and study of the structure of complex networks. One can easily install NetworkX by using the following commands:

pip install networkx

or

conda install -c anaconda networkx

After the installation of the NetworkX, we create an empty graph G. Newly created graph is an undirected graph since the relationships in the above network are reciprocal.

import networkx as nx
import matplotlib.pyplot as plt
# Create an empty graph with no nodes and no edges
G = nx.Graph()

We can continue with adding nodes to this graph and then inserting necessary edges between the corresponding nodes. Here the nodes are the individuals in the networks and edges are the connection between individuals. To keep it short, we directly add edges between given nodes.

# Add edges to the graph G
G.add_edge('Tom Hiddleston','Chris Hemsworth')
G.add_edge('Tom Hiddleston','Robert Downey Jr')
G.add_edge('Chris Hemsworth', 'Brie Larson')
G.add_edge('Chris Hemsworth','Robert Downey Jr')
G.add_edge('Brie Larson', 'Jude Law')
G.add_edge('Robert Downey Jr', 'Jude Law')
G.add_edge('Robert Downey Jr', 'Zach Galifianakis')
G.add_edge('Robert Downey Jr', 'Tom Holland')
G.add_edge('Michael Keaton', 'Tom Holland')
G.add_edge('Michael Keaton', 'Zach Galifianakis')

Since we have a graph now we can visualize it and see its structure with the below code:

import matplotlib.pyplot as plt
nx.spring_layout(G) # for a better layout
nx.draw(G, with_labels=True, node_size=500, node_color="skyblue", node_shape="s", alpha=0.5, linewidths=40)
plt.draw()
A graph that is created a Python’s NetworkX package for the male\female actor graph.
Graph (G) created with NetworkX

Once we create a graph we can show some concepts of graphs such as degree, clustering coefficient, and distance.

  • The degree of a node is the number of connections a node has. Let’s see the degree of Robert Downey Jr from our graph G:
nx.degree(G, 'Robert Downey Jr')
>> 5
  • The local clustering coefficient for a node is the proportion of edges between nodes divided by the number of edges that could possibly exist between them. Let’s see the clustering coefficient of Robert Downey Jr from our graph G:
nx.clustering(G, 'Robert Downey Jr')
>> 0.1
  • The distance is the number of edges in the shortest path between two nodes in a graph. First, let’s see the shortest path between Brie Larson and Michael Keaton from our graph G:
nx.shortest_path(G, 'Brie Larson', 'Michael Keaton')
>>>['Brie Larson', 'Chris Hemsworth', 'Robert Downey Jr', 'Zach Galifianakis', 'Michael Keaton']

The output of the code means that; to go from Brie to Michael in the shortest way we should follow Chris, Robert and Zach nodes respectively. To find the distance between nodes we can use the breadth-first search (BFS) algorithm from NetworkX and get the BFS tree as an output.

T = nx.bfs_tree(G, 'Brie Larson')
nx.draw(T, with_labels=True, node_size=100, node_color="skyblue", node_shape="s", alpha=0.5, linewidths=40)
A BFS tree outputted for the given social network in the case of starting from Brie to every other node.
BFS Tree from Brie to every other node

Importance of a Node

Since we have a graph from now on we can venture into the world of graph analytics. In this post, we will be focusing on centrality based concepts.

Centrality is an important concept used for graphs to discover the most important nodes. For instance, one can identify the most influential actor in the social network that we have created beforehand using the centrality of that node. Let’s take a look at some social network analysis measures and see how they work and when they should be used in the network analysis applications.

Degree Centrality

Degree centrality assigns an importance score to the nodes according to the number of incoming and outgoing relationships from that node. Usually, there exists two measures of degree centrality in directed graphs; indegree and outdegree where the former one represents the number of incoming edges to that node and the latter one represents the number of outgoing edges from that node. If the edges represent positive relations like friendship, indegree can be interpreted as a form of popularity. Thus, the higher the degree of a node, the more important it is in a graph. Let’s see the degree centrality of nodes in our social network graph with NetworkX:

nx.degree_centrality(G)
>>
{'Tom Hiddleston': 0.2857142857142857,
'Chris Hemsworth': 0.42857142857142855,
'Robert Downey Jr': 0.7142857142857142,
'Brie Larson': 0.2857142857142857,
'Jude Law': 0.2857142857142857,
'Zach Galifianakis': 0.2857142857142857,
'Tom Holland': 0.2857142857142857,
'Michael Keaton': 0.2857142857142857}

As we have said in the introduction, Robert Downey Jr is the most popular actor amongst others with the degree centrality of 0.71.

Eigenvector Centrality

Eigencentrality is a measure of the importance of a node based on its neighbors. In contrast to degree centrality, we can identify nodes with influence over the whole network not just the connected ones to that node.

nx.eigenvector_centrality(G)
>>
{'Tom Hiddleston': 0.3590565590148091,
'Chris Hemsworth': 0.42675046506987974,
'Robert Downey Jr': 0.5830737258434554,
'Brie Larson': 0.25807944061957305,
'Jude Law': 0.2990832034539333,
'Zach Galifianakis': 0.2774831019335862,
'Tom Holland': 0.2774831019335862,
'Michael Keaton': 0.1973270236079402}

If a node is connected to highly important nodes, it will have a higher Eigencentrality value. Thus, we can see that Chris’s node gains importance since it is connected to an important node.

Closeness Centrality

Closeness centrality measures each node based on its nearness to all the other nodes within the network.

nx.closeness_centrality(G)
>>
{'Tom Hiddleston': 0.5384615384615384,
'Chris Hemsworth': 0.5833333333333334,
'Robert Downey Jr': 0.7777777777777778,
'Brie Larson': 0.4375,
'Jude Law': 0.5384615384615384,
'Zach Galifianakis': 0.5384615384615384,
'Tom Holland': 0.5384615384615384,
'Michael Keaton': 0.4117647058823529}

In our small graph G, we can say that Robert’s node is best placed to influence the network in the shortest time.

Betweenness Centrality

Betweenness centrality quantifies the number of times a node lies on the shortest path between other nodes. If the node has high betweenness centrality then it plays a significant role within the network.

nx.betweenness_centrality(G)
>>
{'Tom Hiddleston': 0.0,
'Chris Hemsworth': 0.14285714285714285,
'Robert Downey Jr': 0.6666666666666666,
'Brie Larson': 0.023809523809523808,
'Jude Law': 0.09523809523809523,
'Zach Galifianakis': 0.11904761904761904,
'Tom Holland': 0.11904761904761904,
'Michael Keaton': 0.023809523809523808}

For instance, Chris has the highest betweenness centrality amongst others, which means that his node acts as a bridge between nodes and influences the flow of this network.

PageRank

PageRank algorithm is a variant of eigenvector centrality which assigns scores based on nodes’ neighbor sand their neighbors’ neighbors.

nx.pagerank(G)
>>{'Tom Hiddleston': 0.09959071899664632,
'Chris Hemsworth': 0.14480484224961168,
'Robert Downey Jr': 0.23419151394355436,
'Brie Larson': 0.10333101666276434,
'Jude Law': 0.10247885385570858,
'Zach Galifianakis': 0.10415927270241394,
'Tom Holland': 0.10415927270241394,
'Michael Keaton': 0.10728450888688691}

With the PageRank algorithm, we can detect the most influential node of our network as Robert’s node.

All in all

Our social network example was so small with just 8 nodes and 10 edges, but by using the same codes you can analyze a larger graph. There are several sources like Stanford Large Network Dataset Collection where you can find network data to use in your applications. Even though it is a small network, we were able to see some concepts on this graph clearly. For instance, Robert Downey Jr is the most important node of this graph according to all mentioned centrality measures and we can say that he is the most popular actor amongst others.

The field of graph analytics is vast and has a lot of practical applications. The aim of this post was to give an insight to the reader about the centrality and graph analytics.

Graph Analytics Libraries

Some libraries that can be used for graph analytics are listed below:

References

Acknowledgment

Special thanks to Gökçe Uludoğan, Hakime Öztürk, and their great knowledge about movies with their help for creating a movie network :)

--

--