Community Detection: Label Propagation vs. Girvan-Newman (Part 1)

Braiden Hook
smucs
Published in
6 min readApr 13, 2022

This article, where my partner Michael Doherty and I explore different community detection algorithms, has been split into two parts. This part explores community detection and offers a brief overview of the Girvan-Newman and Label Propagation algorithms. The second part quantitively compares the two algorithms and explores their advantages and disadvantages. The second article can be found below:

Community Detection: Label Propagation vs. Girvan-Newman (Part 2)

Community Detection

Communities play a significant role in a person’s life; from our familial communities and work communities to our friend communities and college communities, the communities we’re a part of define who we are. In our data driven society, we can pose the question: given a graph representing connections between different groups, can we deduce the community each group is a part of? Indeed we can, and the method to do so is called community detection.

Definitions

To start, I’ll define a graph and its properties as it relates to community detection. A graph is a collection of data points that are connected in some way. Each data point is represented by a node (or vertex), and a connection between two nodes is represented by an edge. For example, a graph could consist of a group of Facebook users, some of whom are friends with each other. Each Facebook user would be represented by a node, and an edge between two nodes would mean those two Facebook users are friends. For the purposes of this project, we dealt with only undirected and unweighted graphs. Undirected means all edges are bidirectional, and unweighted means all edges are equal in value. A neighbor node is a node that is connected to another node; for example, if an edge exists between node A and node B, then node B is node A’s neighbor node (and vice-versa).

For this project, two Community Detection algorithms were implemented: the Girvan-Newman algorithm and the Label Propagation algorithm.

Girvan-Newman Algorithm

The Girvan-Newman algorithm, developed by Michelle Girvan and Mark Newman, detects communities by determining the shortest path between every node and removing the edges with the most shortest paths passing through them. This breaks down the graph into smaller clusters, which are the communities the algorithm generates. The Girvan-Newman algorithm essentially consists of four steps:

  1. For every edge in the graph, calculate its betweenness (which is how many shortest paths pass through it)
  2. Remove the edge (or edges) with the highest betweenness from the graph
  3. Recalculate the betweenness for the remaining edges
  4. Repeat steps 2 and 3 until no edges remain (or the modularity is reached)

This process can be visualized below:

Our starting graph

Consider the above graph. The first step for the Girvan-Newman algorithm is to calculate the shortest path between every node, thus giving us the betweenness for each edge. After doing so, we end up with the following graph:

Our graph after calculating betweenness

As the above graph shows, the edge between node B and node D has the highest betweenness value. These values are gotten through this code below, which increments the betweenness of each edge for each time there is a different source node (looping through each vertex in the graph):

// Calculates edge betweenness
// edgePaths contain for each loop through true the vertices visited from the src vertices
// Then they add up the node score and edge credit from each edge from each node
// Which will then result in the total edge betweenness for each edge
// This will be computed for each vertex in the graph after being ran through BFS
for (int i = edgePaths.size() - 1; i > -1; i--){
auto levelNodes = edgePaths[i];

for (auto const& j : levelNodes){
auto edges = j.second;
for (int k = 0; k < edges.size(); k++){
double btwVal = vScore.at(j.first).second *
vScore.at(edges[k]).first /
vScore.at(j.first).first;
pair <Graph::vertex_descriptor,
Graph::vertex_descriptor> w;
w.first = j.first;
w.second = edges[k];
if (edge_centralities.count(w) == 0) {
w.first = edges[k];
w.second = j.first;
}
edge_centralities.at(w) += btwVal;
vScore.at(edges[k]).second += btwVal;

}
}
}

Thus, the algorithm removes the edge with the highest betweenness, resulting in the following graph:

Our graph after removing the edge with the highest betweenness

As the graph indicates, there are now two communities that have been formed, community C₁ consists of nodes A, B, and C, while community C₂ consists of nodes D, E, F, and G. Seems simple enough right? Well, this is the result of only one iteration of the Girvan-Newman algorithm. How does the algorithm know when to stop? That’s where the concept of modularity comes into play.

Modularity

Essentially, modularity measures how strong the current division of a graph into separate communities is. A graph with a high modularity has many connections between nodes within a community and few connections between nodes in different communities. The modularity is calculated using the following formula:

Essentially, we just calculated the number of edges within a group and subtracted the number of expected edges within that group from it. If the calculated modularity + 0.002 is less than the previous modularity, then the algorithm stops and the graph reverts back to the previous graph (as we want as high a modularity as possible).

Label Propagation Algorithm

The Label Propagation algorithm is another community detection algorithm; its community detection process involves initializing every node with a unique label and then iterating through every node, assigning each node the label that appears most frequently among its neighbor nodes. Label propagation essentially consists of five steps:

  1. Initialize each node with its own unique label (For example, initialize node 0 with label 0, node 1 with label 1, etc.)
  2. Randomize the order in which the nodes will be visited
  3. Calculate the most frequent label among the current node’s neighbor nodes, and set the current node’s label to be that label. Any ties are broken uniformly randomly
  4. Repeat steps 2 and 3 until every node has the label that most frequently appears among its neighbors

The process can be visualized below:

The Label Propagation process. Each new graph represents the updating of a node’s label

Ideally, the Label Propagation process would continue until no node changes its label; however, there are some graphs that will result in nodes having multiple labels tied for most frequent among its neighbors. In these situations, the node could repeatedly change its label after each iteration. This is illustrated below:

After the initial graph (t — 1) goes through Label Propagation, the graph essentially splits itself into two halves. It’s possible that the graph could repeatedly bounce back and forth between being graph t and graph t + 1 after each Label Propagation iteration.

Thus, Label Propagation stops when every node has the most frequent label among its neighbors (even if its tied for most frequent), preventing this issue from occurring.

So how do these two algorithms compare? In the second part of this article we examine the advantages and disadvantages of each algorithm, including their efficiency and accuracy when generating communities.

References

[1] M. Girvan and M. E. J. Newman, “Community structure in social and biological networks,” Proceedings of the National Academy of Sciences, vol. 99, no. 12, pp. 7821–7826, Jun. 2002, doi: 10.1073/pnas.122653799.

[2] Raghavan, Usha Nandini, Réka Albert, and Soundar Kumara. “Near linear time algorithm to detect community structures in large-scale networks.” https://arxiv.org/pdf/0709.2938.pdf

[3] “Girvan-Newman algorithm.” https://networkx.guide/algorithms/comm unity-detection/girvan-newman/

[4] “Write steps of Girvan- Newman Algorithm. Explain clustering of Social-Network Graphs using GN algorithm with example?” https://www.ques10.co m/p/42323/write-steps-of-girvan-newman-algorithm-explain-clu/

[5] Chiang, Jeffrey. “Girvan-Newman — The Clustering Technique in Network Analysis Part 2.” https://medium.com/analytics-vidhya/girvan-newman-the-clustering-technique-in-network-analysis-part-2-a62dfdde11e

Author

My name is Braiden Hook, a junior studying computer science at Southern Methodist University. I am currently testing the waters of AI and Machine Learning.

--

--

Braiden Hook
smucs
0 Followers
Writer for

Computer Science student at Southern Methodist University