Why Girvan-Newman?

Published in

smucs

9 min readApr 10, 2022

Before we delve into the fundamentals and shortcomings of the Girvan-Newman Algorithm, note that this article is split up into two parts, in which Gabriel Mongaras and I researched and analyzed several implementations of community detection, one of which will be further discussed below. I will be linking to his article, “Community Detection With Neural Networks”, here:

Community Detection With Neural Networks

In this article, I will be covering the fundamental ideas behind community detection. See our GitHub Repository for the actual implementation:

Girvan-Newman Spring 22 Project 03

The Community Detection Problem:

Community Detection refers to the automated discovery of highly interconnected vertices in a graph. This process depends on many factors, but for our purposes, we will be looking specifically at unweighted, undirected graphs, and classifying these given vertices into different communities.

For example, see the graph below:

https://towardsdatascience.com/community-detection-algorithms-9bd8951e7dae

In this graph, the communities are split into different sections, signified by the colors given to the groups of vertices. It is important to note that there exist edges between the communities, but this isn’t always the case.

There are many reasons why one would want to use community detection, but they will not be discussed in this article. Rather, we aim to look at an implementation of the solution to this problem, that being Girvan-Newman.

The Girvan-Newman Algorithm:

With that said, let’s turn our attention to the focus of this article. What exactly is this algorithm, and how does it discover these clusters of vertices?

Developed by Michelle Girvan and Mark Newman, this algorithm relies on the iterative elimination of edges that have the highest number of shortest paths, otherwise known as betweenness or betweenness centrality. By removing edges from the graph one by one, the graph is broken down into pieces, or communities, since vertices in these clusters fundamentally have lower betweenness due to the nature of the shortest path calculation.

The four main steps of this algorithm are as described below:

1. Calculate betweenness for all edges in the network.

2. Remove the edge with the highest betweenness.

3. Recalculate betweennesses for all remaining edges.

4. Repeat from step 2 until no edges remain.

Ok, this is a lot to take in. Let’s look at an example of this actually working.

Starting Graph with Calculated Betweenness

The numbers above the edges on this graph represent the betweenness of the specific edge. We note that since the betweenness of the edge (C, D), is the largest value, by step 2, we will remove that edge from the graph and recalculate the betweenness for the remaining, unconnected edges.

Graph after Edge Removal and Recalculation of Betweenness

As can be seen, there are clearly two communities that have formed from the initial graph. So, that’s the entire algorithm, right? Well, there is a lot more behind the algorithm and the processes to actually calculate the necessary values. And we still haven’t even mentioned the stopping criteria.

Note: For the purposes of this explanation, we will be looking at the graph seen below, since there are many different visualizations of the same graph that will be explored in the following sections:

Edge Betweenness:

The betweenness of an edge (A, B) is defined as the number of pairs of nodes such that the edge (A, B) lies on the shortest path between said pair of nodes. Since, in this definition, there can be multiple shortest paths, the credit that one edge would get gets split among each of the shortest paths when that is the case. As discussed before, a high score is bad and will cause the edge to be removed from the graph, further meaning (A, B) would not belong to the same community. To calculate this value, we must perform a Breadth-First Search, or BFS, for each node. The best way to visualize this process is with a Directed Acyclic Graph, or DAG, as shown below:

This visualization of the graph will allow us to perform BFS, since each DAG edge, or edge between two different levels of the graph will be part of at least one shortest path from the given root. We can visualize these edges as having parents and children, as shown in the graph above. We can label each node by the number of shortest paths that reach it from the root. Note that while node G does have a credit of 2 at this step, it is much easier to think of it as having a credit of 1, so that the next part of this process is easier to understand.

From this point, there are many rules and calculation steps in order to find the true credit of each edge and node, but, for the sake of simplicity, we can just think of starting from the leaves, and gradually moving up levels, calculating the credits for each node depending on the factors of the other nodes.

DAG With Calculated Edge / Vertex Credits

The idea here is that we look at each node individually, and the shortest path to the root node E. For example, for node A, we note that there is only one path from (A, B), so we credit the edge with 1, and add 1 to the credit for B. Since (A, D) goes down the same path, we only add 1 to the credit of the edge (B, D). Same idea for (D, E). Note how we only move from leaf to root so as to not credit the same edge twice for the same shortest path. Let’s look at a more interesting node now. For nodes with two shortest paths to the root E, such as node G, each shortest path gets credited with a fraction of the amount, since the path from (G, E) can take two paths, both of the edges to G are credited with 1/2, rather than one full credit. This same reason is why the credits for all the vertices and edges above G end with 1/2. See the full process below:

Assume each node begins with a credit of 1. Starting with the first graph, we look at leaf A first. Calculating A causes (A, B), B, (B, D), D, and (D, E) to increase by 1. C is similar, as (C, B), B, (B, D), D, and (D, E) all increase by 1. Now, we look at B, since A and C are fully calculated. Calculating B increases (B, D), D, and (D, E) by 1. So here is what has been done so far:

A = C = (A, B) = (C, B) = 1, B = 3, (B, D) = 3, D = 4, (D, E) = 3.

Now, let’s look at the rest of the graph. Calculating node G causes (G, D), D, (D, E), (G, F), F, and (F, E) all to increase by 0.5 since they both are the shortest paths to E, and the credit gets split between them. Next, D and F basically identical, with both (D, E) and (F, E) increasing by 1. So,

G = 1, (G, D) = (G, F) = 0.5, D = 4.5, F = 1.5, (D, E) = 4.5, (F, E) = 1.5

We notice that these values are identical to the ones calculated above.

Whew! Now we are finally done with node E. However, to complete the betweenness calculation, we have to repeat this calculation for every node as the root and sum the contributions. We then divide by 2 to get the true betweenness for each edge, since every path is explored twice here.

Once we do all of this, we end up with the final calculated graph below (which you may note has similar edge credits to the DAG above).

Final Betweenness Credit Calculations in Original Graph

All we have to do from here is remove the edge with the highest credit (betweenness), which in this example happens to be edge (B, D). This leaves us with two communities, which would, in theory, be a good stopping point. However, how do we decide when to terminate the iteration of the algorithm? If we let it go on for too long, we will end up with something like this:

What is Left When the Algorithm Runs Too Long

Well, this is where modularity comes into the picture.

Modularity:

In order to correctly discover the communities in a given graph, we must stop the removal process at a good time so that the community detection algorithm works optimally. To do so, we calculate a value known as Modularity, which uses eigenvectors and matrix mathematics in order to maximize the amount, and therefore determine a stopping point for the algorithm.

So, how do we actually calculate Modularity?

Well, it’s complicated. Perhaps looking at code will help us understand:

sum1 = 0
sum2 = 0

# Iterate over Nodes in Old Graph (i)
for i in list(oldG.nodes):
    
    # Get the Neighbors of Node i
    neighbors_i = list(G.neighbors(i))
    k_i = len(neighbors_i)
    
    # Find the Communities Associated with Node
    comm = []
    findCommunities(G, i, comm)
    
    # Iterate over Nodes in New Graph (j)
    for j in list(G.nodes):

        # Calculate B Value
        
        # Get the Neighbors of Node j
        neighbors_j = list(G.neighbors(j))
        k_j = len(neighbors_j)
        
        # Find Number of Edges Between Node i and Node j
        A_ij = 1 if ((i, j) in G.edges or (j, i) in G.edges) else 0
        
        # Calculating B
        B = A_ij - (k_i * k_j) / (2 * m)

        # Calculate s Value (s_i * s_j) + 1
        s = 1 if j in comm else -1
        s += 1
        
        # Compute Final B values for Iteration
        B_1 = B*s
        B_2 = B
        
        # Sum B values
        sum1 += B_1
        sum2 += B_2

# Compute the Final Q Value
Q = (1 / (2 * m)) * (0.5 * sum1 - sum2)

The actual process occuring is as follows:

# Process:
# - B_ij = (A - (k_i - k_j) / 2m)
#    - A_ij = Number of Edges Between Node i and j
#    - k_i = Degree of Node i
#    - k_j = Degree of Node j
#    - m = Number of Edges in Old Graph
# - s_i * s_j = 1 if i and j are in the Same Group, -1 Otherwise
# - sum1 = The First Summation
#    - sum_ij[ B_ij * (s_i * s_j + 1) ]
# - sum2 = The Second Summation
#    - sum_ij[ B_ij ]

Thankfully, the code helps us understand what is actually going on much more clearly that the definition of Modularity shown above. In order to find Modularity, we iterate over a given node, its neighbors, and its predicted community, all of which tell us more about the state of the removal. However, it doesn’t really go into-depth on the mathematics behind the method. If you want to know more about the reason for the mathematics behind this, please refer to our sources below, since it is out of the scope for this article.

Final Notes:

So what have we learned from looking at this process? Well, Girvan-Newman is really, really slow. In our implementation, we optimized many of the processes by taking subsets of the original graph and whatnot, but the overall time to discover these communities is quite notable, and the accuracy of the algorithm is often not worth the time commitment. And just imagine how long it would take to compute something like this:

So, what other ways can we implement community detection, so that we can increase accuracy or cut down the overall time that the algorithm takes?

In part two of this article, we look at a Neural Network implementation of community detection, and the upsides and downsides of doing so.

References:

https://networkx.guide/algorithms/community-detection/girvan-newman/

https://www.pnas.org/doi/full/10.1073/pnas.122653799

https://www.pnas.org/doi/full/10.1073/pnas.0601602103

http://infolab.stanford.edu/~ullman/mmds/book0n.pdf

Why Girvan-Newman?

Written by Trevor Dohm