An introduction to Graph Analysis and NetworkX

Published in

Data Reply IT | DataTech

17 min readJan 24, 2024

Introduction

In this article, we embark on a exploration of graph theory and the powerful NetworkX library. The goal is to provide you with a thorough introduction to the foundational principles of graph theory and showcase its real-world applications. Throughout the journey, we’ll delve into the intricacies of representing and manipulating graphs efficiently using the Python library, NetworkX.

Now, let’s embark on a journey into the fascinating realm of graphs.

Fundamentals of Graph Theory

Graph definition

From a theoretical perspective, a graph can be regarded as a mathematical and graphical formalism essential for representing binary relationships within a collection of elements. Structurally, a graph comprises a set of nodes (V), also known as vertices, and a set of potential connections between these nodes, referred to as edges (E). To illustrate, nodes may symbolize entities or objects in the real world, while edges represents the connections between these nodes. This fundamental concept allows us to utilize graphs for modeling different systems, ranging from social networks to computer networks, transportation systems, and beyond.

Different types of graphs

Graphs can be categorized into various types depending on the characteristics of their edges. Among the common classifications are directed graphs, also referred to as digraphs, where edges possess a specific direction, and undirected graphs, where edges lack directionality. Furthermore, graphs can be weighted, enabling edges to convey additional information or significance.

Directed Graph

A directed graph, is represented by a pair D = (V, E), where V is the set of vertices of D and E is the set of oriented edges of D. An edge is oriented if it has directionality. Being endowed with a directionality, it is composed of a head (represented by an arrowhead), which reaches a vertex inbound, and a tail, which leaves it outbound. Figure 1.1 shows an example of an oriented graph.

Undirected Graph

An undirected graph, is represented by a pair U = (V, E), where V is the set of vertices of U and E is the set of not-oriented edges of U. This means that an undirected graph is a set of vertices and edges where the u-v connection has the same meaning as the v-u connection. Figure 2 illustrates an example of an undirected graph.

Weighted Graph

Both directed and undirected graphs can be characterized by a weight on the edge. In this case it is called a weighted graph. Typically, the weight represent the strength of the connection. In real-world contexts, it is usual to associate a weight with the edges, as they provide a more accurate representation of the relationships in the graph. For example, depending on the context, the weight might represent the distance between two cities or the travel time, or even the intensity of a friendship or the affinity between people. In figure 3 is shown an example of a directed weighted graph.

Connections between graph

In a graph, nodes are connected if there is a path connecting them. However, if no path exists between two nodes, they are unconnected.
A path P, is a sequence of nodes (v0, v1, …, vn), connected to each other by a sequence of edges ((v0, v1), (v1, v2), …, (vn-1, vn)). The first and last nodes of the path, denoted as vertices v0 and vn respectively, are called extremes of the path. If the extremes of the path coincide, we are in front of a closed cycle (or closed path). An example of a closed path is shown in Figure 4. The length l(P) of path P is given by the number of edges in it.

Figure 4: Example of a closed cycle; in this example, the lenght of the path is 3; l(P) = 3

Moreover, the connection between vertices is a mathematical relationship and, in particular, it results in an equivalence relationship. In fact, the transitive, reflexive and symmetric properties are respected (although we won’t delve into their demonstration here).

There are other concept correlated to the graph we need to understand. For example, a subgraph S of graph G is a graph that contains a subset of nodes and edges of G. Instead, the concept of neighborhood of a vertex v, is the set of all adjacent nodes to v. Finally, two nodes are adjacent nodes if there is an edge that connects them. Similarly, two edges are adjacent edges if they share a common nodes.

Adjacency Matrix for graphs representation

Up to this point, our discussion has revolved around the conceptual understanding of graphs and their representation. It’s only natural to ponder: how does one represent this type of data structure for a computer? While you might initially visualize graphs as circles interconnected by arrows, there exists a formal method for representation — namely, Adjacency Matrix. For any given graph, its adjacency matrix A comprises a square binary matrix with the nodes of the graph serving as both row and column indices. In the matrix, at position (i, j), a value of 1 signifies the presence of an edge from vertex i to vertex j in the graph, and 0 indicates otherwise. Figure 5 illustrates an example of an adjacency matrix for an unweighted graph with nodes labeled from A to E.

Figure 5: Example of an adiacency matrix for a not weighted graph

Despite its simplicity, the adjacency matrix comes with computational disadvantages. In scenarios involving sparse graphs or graphs with numerous nodes, an alternative approach based on adjacency lists proves advantageous. This representation involves constructing, for each vertex, a list enumerating its neighbors.

Main metrics of centrality

Within this section, we will embark on a deeper exploration of the analytical facet of graph theory. Our focus will be on introducing key metrics employed to evaluate the significance of nodes, commonly known as centrality metrics. While the concept of centrality is intricately linked with importance, it’s crucial to acknowledge that importance can assume several meanings across various analytical perspectives. Figure 6, in fact, illustrates distinct types of importance within a graph.

Fiigure 6: An example of different centrality in a graph

Don’t be afraid for figure above, now we will understand together the different metrics.

Degree Centrality

Degree Centrality, associated with a node, is the number of incident edge on the node. In other words, it’s equal to the degree of the vertex (which is the number of the neighboring vertices).

In undirected graphs, a node’s degree simply counts the number of edges that connect to it. Matematically, for an undirected graph G = (V,E), the degree centrality deg(v) of a node v ∈ V is:

Nodes with high degree centrality may play critical roles in the network, as they have many connections and can influence the flow of information or resources.

In directed graphs, the concept of degree centrality can be split into two categories:

In-Degree Centrality: This metric measures how many incoming edges a node has. In social networks, for instance, a node with a high in-degree may be considered influential, as it receives a significant amount of attention or information from others.
Out-Degree Centrality: Out-degree centrality quantifies the number of outgoing edges from a node. In a directed network, nodes with high out-degree may be seen as sources of information , as they have a greater ability to disseminate content to other nodes.

Closeness Centrality

The Closeness Centrality metric is a measure of proximity and is calculated as the reciprocal of the sum of the lengths of the shortest paths between a given node and all other nodes in the graph.

If we define the shortest path between two nodes u, v ∈ V , on an undirected graph G, as d(u, v), then the Closeness Centrality D(v) of a node v is defined as:

The key point is that, the more central a node is according to this metric the closer it is to all other nodes. In network analysis, closeness centrality is crucial as it reveals nodes that are key to minimizing communication or interaction distances, ensuring more effective and rapid information exchange within the graph.

Betweenness Centrality

The Betweenness Centrality is related to the number of times a node finds itself along the shortest path between other pairs of nodes in the graph. It measures the strategicity of a node in the graph between two important areas of the same. In other words, it is able to identify bridge nodes in the network. Formally, given a graph G = (V, E) and considering two nodes s and t we can define Betweenness Centrality as following.

where the numerator represents the number of shortest paths between s and t that pass through v, and the denominator represents the number of shortest paths from s to t.

In real scenarios, for example a traffic network, the significance of Betweenness Centrality lies in its ability to identify key roads through which a substantial number of shortest routes traverse. This capability enables the enhancement of traffic management and the optimization of network performance.

Eigenvector Centrality

Eigenvector Centrality operates on the principle that a node’s significance is not only dictated by the quantity of its connections but also by the quality of those connections. In essence, a node gains importance if it is linked to other important nodes.

Imagine a social network where individuals with high Eigenvector Centrality aren’t only connected to many people; they are connected to those who, in turn, have power over a wide network of connections. These individuals might not have the most friends, but they are friends with the most influential people.

Formally, we can represent the Eigenvector Centrality E(v) associated to the node v, as following.

Where:

λ is the largest eigenvalue of the adjacency matrix.
Aij is the elements of the adjacency matrix, indicating whether there is a direct link between nodes i and j (1 if there is a link, 0 otherwise).
Xj is the elements of the eigenvector vector associated with the largest eigenvalue.

However, the computation of a node’s importance in Eigenvector Centrality involves evaluating the importance of all other nodes in the graph, making this metric computationally demanding. An approximation and enhancement of this concept are provided by PageRank.

Pagerank

PageRank was initially developed to individuate the most important pages within the web pages. It calculates the importance of each vertex in the graph relative to every other vertex. Generally, the importance of a vertex increase if it is referred by a lot numbers of vertices and if it refers to an important vertex.

To best understand the algorithm, it is fair to understand these assumptions.
An ipotetic surfer navigates through the graph, from vertex to vertex, in a long number of turns. The starting point can be anywhere (there is no starting constraint). At each turn, the navigator selects the next vertex from connections outside the current position with a damping (d) probability.
However, there is a (1-damping) probability that the navigator selects the next vertex randomly in the entire graph structure.

The pagerank of a node A is defined as following.

Where:

T1, …, Tn are pages direct to A.
d is dumping factor or dumping probability.
C(T) is the number of outgoing link of page T.

It’s important to know that Pagerank is widely used in various real-life situations beyond its original application in web page ranking. Its utility extends to various contexts, including but not limited to biological networks and fraud detection.

Real Life Applications

Transitioning from abstraction to practicality, we move beyond the theoretical aspect to explore the tangible applications of graphs. While graphs may initially appear as abstract concepts, seemingly distant from our everyday reality, this section aims to contextualize them across various domains and practical scenarios. Through this exploration, we’ll reveal how graphs play a pivotal role in numerous disciplines, influencing various aspects of our daily lives. In the ensuing discussion, we’ll delve into three examples among different applications where graphs demonstrate their significance.

Logistic sector and Google Maps

In the Transportation and Logistic sector, graphs play a key-role to optimizing routes, reducing costs, and reduce the movement of goods and people.

For example, in the route optimization, graphs are used to model transportation networks, including roads, railroads, and air routes. By treating a route segment as an edge and locations as nodes, graph algorithms can calculate the shortest or most efficient paths. This is especially important for GPS navigation systems or delivery services. The algorithms help find the fastest route from point A to point B, avoiding traffic congestion and saving time. If you’re thinking of Google Maps, you’re absolutely correct! Google Maps, at its core, relies on graphs and implements the route optimization paradigm.

When it comes to finding the shortest path, map services rely on graph theory. Take, for instance, the scenario where ‘S’ is the initial point, ’E’ is the target, and all places in between are considered nodes. These nodes are interconnected both horizontally and vertically through weighted connections, representing potential routes. Essentially, these nodes and edges collectively constitute a road network, epitomizing the directed graph we explored earlier.

Social Network Analysis

One of the most classical examples of graph utilization is evident in the analysis of social networks. Platforms like Linkedin, Facebook or Instagram are social networks. In this scenario, from a modeling perspective, each user is a node in the graph, and the connections between users are the edges. Graph algorithms are used to discover friendship dynamics, identify communities of interest, predict user behavior or suggesting potential matches.

According to the concept of metrics, in a social network, if a user has high degree centrality, it is very likely that the user is an influencer. Instead, if it have high betweenness centrality, it is possible that they act as a bridge between two communities discussing or dealing with different topics.

Fraud Detection

Fraud detection is a critical application of graph theory in the financial and cybersecurity sectors. Graphs are used to represent and analyze complex networks of transactions and behaviors to identify fraudulent activities. For example, in the scenario related to transaction networks, each transaction between accounts can be represented as an edge in a graph, while the accounts/entities themselves are nodes. By analyzing the flow of money and connections, graph algorithms can identify patterns indicative of fraud.

Staying in the same scenario, it is also possible to realize anomaly detection; for example, if a user’s transaction behavior suddenly changes, such as a strong increase in the number of transactions, this anomaly can be detected. Likewise, if a credit card is used in locations that are geographically distant in a short time frame, a graph algorithm could flag it as a potential anomaly.

While there are myriad other scenarios, the above overview provides a glimpse into some of the potential applications. As promised, we will get down more and more into the practical side and now we will look at the most widely used Python library for graph analysis.

Introduction to NetworkX

NetworkX is a Python library designed to facilitate the creation, manipulation, and analysis of complex networks and graphs. It offers a user-friendly and adaptable interface for building graphs and conducting a range of graph-related tasks.

To employ NetworkX, it’s necessary to install it through the “pip” package manager. This can be accomplished by entering the following command in your terminal or command prompt:

pip install networkx

After NetworkX installation, you can start using it in a Jupyter notebook or other type of Python script.

Graph creation from scratch

At the beginning, you need to import the NetworkX and also Matplotlib packages, for a graph visualization. We start to create simple graph using NetworkX from scratch. In the following snippet of code, you can create your first undirected graph from scratch.

# Import of necessary packages
import networkx as nx
import matplotlib.pyplot as plt
 
# Creation of an empty undirected graph
G = nx.Graph()

# Adding nodes
G.add_node(1)
G.add_nodes_from([2, 3, 4])

# Adding edges
G.add_edge(1, 2)
G.add_edges_from([(2, 3), (3, 4), (4, 1)])

# graph visualization
nx.draw_networkx(G, with_labels=True)

In the code above, after importing the packages already discussed, we create an empty undirected graph using NetworkX’s Graph class. We add four nodes with numeric IDs using both the add_node() function and the add_nodes_from() function. The difference is that the former function adds nodes individually, the latter adds them from a set of nodes, convenient if we have a list or tuple of nodes. Next and similarly, we create edges using the add_edge() and add_edges_from() methods with the same logic as before. Finally, we use the draw_networkx() function of NetworkX to draw the graph with labels.

The resulting graph is shown in the Figure 7.

Figure 7: result of creating the directed graph as per the code

Moreover, directed graphs, denoted as DiGraph in NetworkX, offer an alternative avenue. Simply by substituting “G = nx.Graph()” with “G = nx.DiGraph()” in the same code provided earlier, we obtain the graph shown in Figure 8, illustrating the directional connections.

# Creation of an empty directed graph
G = nx.DiGraph()

# Adding nodes
G.add_node(1)
G.add_nodes_from([2, 3, 4])

# Adding edges
G.add_edge(1, 2)
G.add_edges_from([(2, 3), (3, 4), (4, 1)])

# graph visualization
nx.draw_networkx(G, with_labels=True)

Figure 8: result of creating the undirected graph as per the code

Graph Analysis

After exploring the creation of a simple graph, let’s now delve into a more complex graph for conducting analyses relevant to the content of this article. To do this, we will use the famous Zachary’s karate club dataset, which originated from the paper “An Information Flow Model for Conflict and Fission in Small Group”. The resulting graph is an undirected graph, shown in Figure 9.

# Importing the dataset
G = nx.karate_club_graph()

# Statistics of the graph
print(f"Number of nodes: {len(G.nodes())}")
print(f"Number of edges: {len(G.edges())}")
average_clustering = nx.average_clustering(G)
print(f"Average Clustering Coefficient: {average_clustering}")
diameter = nx.diameter(G)
print(f"Graph Diameter: {diameter}")
density = nx.density(G)
print(f"Graph Density: {density}")
# Number of nodes: 34
# Number of edges: 78
# Average Clustering Coefficient: 0.57
# Graph Diameter: 5
# Graph Density: 0.14

# Plotting graph with Matplotlib and Networkx
plt.figure(figsize=(7,5))
nx.draw_networkx(G, with_labels=True, font_weight='bold')
plt.show()

In the provided code snippet, we import the dataset using Networkx’s nx.karate_club_graph(). After, we display the main statistics of the graph. In addition to the number of nodes and edges, we also calculate other indicators, which are easily interpreted through the expressiveness of the name of networkx methods. The clustering coefficient measures the tendency of nodes to create strongly connected groups. Diameter represents the maximum length of the shortest paths between all pairs of nodes, and density measures the fraction of existing connections relative to the maximum potential connections. In essence, our graph has a tendency to form local clusters, the maximum distance between nodes is relatively short, and the network is not strongly interconnected.

To enhance visualization, we integrate matplotlib, applying a distinct font to the nodes.

Figure 9: resulting graph of Zachary’s karate club dataset

Now, let’s proceed to apply and interpret the various centrality measures.

Degree Centrality Analysis

Below, we calculate degree centrality through the nx.degree_centrality() method. In addition, we sort the resulting dictionary by values so that we have a decreasing sorting of degree centrality.

# Compute degree centrality
degree_centrality = nx.degree_centrality(G)

# Sorting and printing the degree centrality
sorted_degree_centrality = sorted(degree_centrality.items(), key=lambda x:x[1], reverse=True)
print("Degree Centrality:", sorted_degree_centrality)

Degree Centrality:
[
(33, 0.51), 
(0, 0.48), 
(32, 0.36), 
(2, 0.30), 
(1, 0.27), 
 ... 
]

From the analyses, it emerges that node 33 stands out as the most influential node in the network. This suggests that node 33 holds the potential to influence and motivate all other nodes to the greatest extent.

Closeness Centrality Analysis

Below, we calculate the closeness centrality through the nx.closeness_centrality() method. After, the usual sorting is performed.

# Compute closeness centrality
closeness_centrality = nx.closeness_centrality(G)

# Sorting and printing the closenees centrality
sorted_closeness_centrality = sorted(closeness_centrality.items(), key=lambda x:x[1], reverse=True)
print("Closeness Centrality:", sorted_closeness_centrality)

Closeness Centrality:
[
(0, 0.56),
(2, 0.55),
(33, 0.55),
(31, 0.54),
(8, 0.51), 
...
]

In this instance, node 0 is found to be “close” to numerous other nodes in the network. This indicates a high level of social accessibility and proximity to other club members.

Betweenness Centrality Analysis

Below, we calculate the closeness centrality through the nx.betweenness_centrality() method. After, the usual sorting is performed.

# Compute betweenness centrality
betweenness_centrality = nx.betweenness_centrality(G)

# Sorting and printing the closenees centrality
sorted_betweenness_centrality = sorted(betweenness_centrality.items(), key=lambda x:x[1], reverse=True)
print("Betweenness Centrality:", sorted_betweenness_centrality)

Betweenness Centrality: 
[
(0, 0.43), 
(33, 0.30), 
(32, 0.14), 
(2, 0.14), 
(31, 0.13),
...
]

Node 0, besides being the closest to all, also functions as a bridging node, serving as a social mediator. It likely plays a pivotal role in handling conflicts between different segments of the club, contributing to the maintenance of social cohesion. It’s worth noting that, in general, nodes with high betweenness also experience higher “stress” in the network.

Eigenvector Centrality Analysis

Below, we calculate the closeness centrality through the nx.eigenvector_centrality() method. After, the usual sorting is performed.

# Compute eigenvector centrality
eigenvector_centrality = nx.eigenvector_centrality(G)

# Sorting and printing the eigenvector centrality
sorted_eigenvector_centrality = sorted(eigenvector_centrality.items(), key=lambda x:x[1], reverse=True)
print("Eigenvector Centrality:", sorted_betweenness_centrality)

Eigenvector Centrality: 
[
(0, 0.43763528138528146), 
(33, 0.30407497594997596), 
(32, 0.145247113997114), 
(2, 0.14365680615680618), 
(31, 0.13827561327561325),
...
]

Furthermore, according to this metric, node 0 emerges as the most significant. Its importance arises from its association with individuals considered influential or respected by others. Given the unique topology and modest size of the club, the proximity of node 0 to all nodes, coupled with its role as a bridge, amplifies its significance, particularly due to its proximity to crucial nodes, such as node 33.

Pagerank Analysis

Below, we calculate the pagerank centrality through the nx.pagerank() method. After, the usual sorting is performed.

# Compute pagerank centrality
pagerank = nx.pagerank(G)

# Sorting and printing the Pagerank centrality
sorted_pagerank = sorted(pagerank.items(), key=lambda x:x[1], reverse=True)
print("Pagerank Centrality:", sorted_pagerank)

Pagerank Centrality: 
[
(33, 0.09698041880501741), 
(0, 0.08850807396280012), 
(32, 0.07592643687005646), 
(2, 0.06276686454603017), 
(1, 0.057414840497110056),
...
]

While Pagerank can be customized and its parameters adjusted in practical scenarios (though this isn’t the focus of the article), in this example, we observe how the resulting ranking aligns closely with that obtained through eigenvector centrality, as discussed in the theoretical section.

Community detection

Communities in a network refer to sets of nodes that demonstrate a higher density of connections among themselves compared to the rest of the network. This concept stems from the notion that individuals within a community engage in more frequent and meaningful interactions than those outside of it. Among the most widely used implementation approaches, Louvain’s algorithm and label propagation emerge as effective tools for detecting clusters within a social network. Below is an example for detecting communities.

# community detection with Louvain Community Detection Algorithm
communities = nx.community.louvain_communities(G)

# print all the communities
for i, community in enumerate(communities):
    print(f"{i+1}° Community : {community}")

In the code, the louvain_communities() method is used to implement community detection with the related algorithm. Next, the set of nodes identifying each detected community, in our case four, is printed.

1° Community : {1, 2, 3, 7, 12, 13}
2° Community : {0, 4, 5, 6, 10, 11, 16, 17, 19, 21}
3° Community : {23, 24, 25, 27, 28, 31}
4° Community : {32, 33, 8, 9, 14, 15, 18, 20, 22, 26, 29, 30}

End Note

In this article, we delved into graph analysis, beginning with theoretical concepts and gradually transitioning to a practical approach. We explored typical scenarios and demonstrated a simple networkx demo to illustrate the ease and potency of creating, manipulating, and utilizing graphs

Resources