Photo by Merakist on Unsplash

Girvan–Newman — The Clustering Technique in Network Analysis Part 1

Jeffery chiang
Analytics Vidhya
Published in
5 min readOct 11, 2021

--

Girvan-Newman method separates the network based on the betweenness of the edges.

Introduction

Nowadays, the importance of Network Analysis is attracting people’s attention as the technology advances. Many systems take the form of network, and we can employ the network analysis in the network related data applications. For example, we can use the network analysis in the friend suggestion in the social media platform, like Facebook or Twitter. Also, we might be able to find potential customers based on the CRM (customer relationship management system).

Girvan-Newman method is one of the classic community clustering techniques. By using the algorithm, we are able to separate the network into communities, and the community detection can be used as a good start of data preprocessing.

Reference: Girvan M, Newman ME (2002) Community structure in social and biological networks. Proc Natl Acad Sci USA 99(12):7821–7826

Community Detection

Girvan-Newman will remove the edges with the largest edge betweenness in every iteration.

Edge betweenness: Number of shortest paths passing the edge.

Figure 1. GN example

As we can see from the picture, the edge {D,E} will have the largest edge betweenness. By removing the edge, it will form two communities.

Girvan-Newman Iteration:

First we need to compute the edge betweenness of every edge in the graph

  1. Select a node X, and perform BFS to find number of shortest path from the node X to each node, and assign the numbers as score to each node.
  2. Starting from the leaf nodes, we calculate the credit of edge by (1 + (sum of the edge credits to the node) )*(score of destination node / score of starting node)
  3. Compute the edge credits of all edges in the graph G, and repeat from step 1. until all of the nodes are selected
  4. Sum up all of the edge credit we compute in step 2 and divide by 2, and the result is the edge betweenness of edges.

Next, we remove the edges with the highest edge betweenness, and repeat until we find the good community split.

5. Remove the edges with the highest edge betweenness

6. Compute the modularity Q of the communities split

7. Repeat from step 1, if Q is greater than 0.3–0.7.

Modularity Formula

Example

We will take the graph below as example.

Figure 2. example graph

First, we select node H and find the number of shortest path from the node H to each of the node.

Figure 3. Step 1

We rearrange the graph into a tree-like shape, so it can be visualize easier.

It is obvious that there is only one shortest path from node G to H, which is {GH}. Therefore, we assign 1 to the node G as the node score.

Likewise, there are 2 shortest paths from node F, which are {FEH, FGH}.

The edge EG and AC can be ignored because their nodes are in the same depth in the BFS tree, and they won’t be taken into account in counting the shortest path.

Next, we need to calculate the credit of each edge starting from the leaf node (Node B) using the following formula, where the score is computed in step 1.

Figure 4. Step 2

We start from the leaf node (node B), there is no incoming edge to node B, we split the initial 1 into two edges, so BA = BC = 0.5. Following the formula, we can calculate each of the edge credit as in the figure above.

Next, we pick a different node and repeat the same process until all of the nodes are selected. Figure 5 shows when node G is selected.

Figure 5. Node G is selected

After all the nodes are selected, we sum up all of the edge credit we compute in step 2 (the red number in graph), and then divide by 2. Figure 6 shows the result edge betweenness centrality of the graph.

Figure 6. Edge Betweenness Centrality

As expected, DE has the highest edge betweenness, thereby we remove the edge DE, and forming two communities.

Next, we need to compute the modularity Q, and see if it is a valid partition. We repeat from step one using the updated graph if Q does not satisfy the modularity threshold.

Modularity Formula

, where

  • A_ij = 1 if the node i is connected to node j in the original graph
  • m = the number of edges in the original graph
  • k_i, k_j = degree of the node i, j in the updated graph

Conclusion

There are a lot of clustering methods, Girvan-Newman is one of the classic division clustering methods using network analysis. In practice, we are able to utilize the clustering technique to simplify the dataset, and further improve the model performance.

Thank you for reading, and wish you a great day.

--

--

Jeffery chiang
Analytics Vidhya

Data Science | Machine Learning | Mathematics | DevOps