smucs
Published in

smucs

Girvan Newman Algorithm — Community Detection in Network (Part 2)

Ziyu Sun is a senior majoring in computer science and minoring in civil engineering at Southern Methodist University.

Paige Weng is a second-year computer science student at Southern Methodist University.

Introduction

The Girvan Newman Algorithm is one of classic community detection algorithms proposed by Michelle Girvan and Mark Newman, which belongs to the split hierarchical clustering algorithm. This algorithm removes the edge with the longest betweenness by computing the betweenness of all edges in the graph. The process of calculating betweenness and removing edges is repeated until all edges are removed. In 2004, Newman proposed the weighted Girvan Newman algorithm, which references the concept of Q-function to help the Girvan Newman algorithm distinguish more stable communities.

If you don’t understand communities, complex networks and betweenness etc. It is recommended that you head to the previous article we wrote, it should help you.

Before looking at how to improve the stability of the community divided by the Girvan Newman algorithm, let us first understand the definition of modularity.

Modularity

The modularity function is a global objective function proposed by Newman and Girvan for evaluating community partitioning. There will be some tightly connected regions in the network, these regions are called communities. The main feature of the community is that the internal connection is tight, while the external is relatively sparse. Community detection groups nodes into communities and determines the accuracy of communities by computing modularity. The higher the degree of modularity, the higher the accuracy of the community.

The following is the equation for modularity:

Photo by Wikipedia
  • Where m is the total number of edges in the graph.
  • The degree of any node is k, and kv is the degree of vertex v.
  • Avw is the adjancency matrix, it shows the connection between vertices v and w. Avw will be 1 if vertices v and w are connected, and 0 otherwise.
  • The blue part is called the Kronecker delta function, it will be 1 if vertices v and w are in the same community, and 0 otherwise.

Q is the degree of modularity. The greater the degree of modularity, the better the effect of community division. The range of Q values is [-0.5, 1). Clustering works very well when the Q value is between 0.3 and 0.7.

Computing modularity requires the Kronecker delta function, but how do we know if two nodes are in a community?

Ways to divide the community

Community: A community is defined as a subset of nodes in a graph, and any two nodes in the subset can be connected to each other. A network can have multiple communities, that is, a graph can have multiple subsets. An independent node can also be seen as a community.

First, we start Breadth-first search by choosing an arbitrary node from the graph as a starting node, recording all node of direct or indirect paths to other nodes, and these nodes are in the same community. Nodes of a community are stored in a vector and all communities are stored in a 2D vector.

We then exclude passed nodes and repeat the previous step for the remaining nodes until all nodes are recorded in the community vector.

Finally, to judged whether the two nodes belong to a community, we just need to look at the 2D vector used to store the node community. We check if the two nodes are in the same vector, that is, the same row.

Basic Flow of Girvan-Newman Algorithm with Q-Function

The following is the pseudocode of the Girvan-Newman algorithm with Q-Function:

1. Calculate the edge betweenness of all edges in the network;

2. Find the edge with the highest betweenness and remove it from the network;

3. Calculate the modularity of the network after edge removal;

4. Repeat steps 2 and 3 until the degree of modularity is greater than 0.5.

Girvan-Newman Algorithm Implementation:

The graph below is a small dataset we ran, a network consisting of 21 nodes and 30 edges.

Before implementing the Girvan-Newman algorithm

After running the Girvan-Newman Algorithm to remove edges until the Q value is greater than 0.5, the remaining network with 21 nodes and 25 edges is shown in the figure below. Different colors represent different communities.

After implementing the Girvan-Newman algorithm

Analyze the Advantages and Disadvantages

The advantage of the Girvan-Newman algorithm is that it iterates the entire network, so it can provide more accurate community differentiation results. The disadvantage is that the time complexity is large, because the entire network is iterative. At the same time, we must also know the number of communities or the modularity of computation to determine the accuracy of community differentiation.

Summary

At the end, by testing on the Football Conference 2000 dataset and other randomly generated datasets, we find that the Girvan-Newman algorithm provides high community accuracy. However, the time complexity is greatly increased due to too many repeated computations in the operation. If the network has m edges and n nodes, the time complexity of the Girvan-Newman algorithm will be O(n³). This algorithm is very time-consuming and is not suitable for large networks with many nodes and edges.

We are still looking for ways to optimize the Girvan-Newman algorithm and reduce the time complexity.

Thanks for reading, I hope this article will be helpful to you!

Code Implementation

If you need the code of our c++ Girvan-Newman algorithm implementation, please check our GitHub.

Test Dataset

The following link is the Football Conference 2000 dataset used for testing, http://www-personal.umich.edu/~mejn/netdata/

References

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store