Community Detection — Girvan Newman

Jeremywaibel
6 min readApr 14, 2022

--

Introduction

This serves as an introduction into our three part series on community detection algorithms. In this particular article, we will introduce community detection and the Girvan Newman algorithm. In part two, we will discuss the Louvain Method for community detection. Part three will serve as a comparison between these two algorithms.

Girvan-Newman Community Detection

Much of today’s computer systems and tools rely on networks. The organization of these networks is important for data analysis and so much more. When dealing with networks, we often organize the nodes into modules, clusters, or communities.

The ability to identify these communities through a process called community detection is arguably the most important and practical application of computer algorithms and graph theory. In our project, we created a graph to hold the nodes from a given dataset. In this example, the goal of community detection was to create communities that held densely linked edges while also having less edges spanning between community to community. Many different algorithms have successfully found communities in a network, one such algorithm is the Girvan-Newman algorithm.

For our project, we implemented the Girvan-Newman algorithm to detect communities from a given dataset ourselves.

Graphs

Before diving into the Girvan Newman algorithm, let’s cover some essential pieces of information that is needed to understand how it works.

What is a graph? A graph is a collection of nodes called vertices (V), connected by edges (E). In programming, they can be stored in an adjacency matrix (Left Representation) or in an adjacency list (Middle Representation).

In the image above, vertex 0 (or A) has an edge with vertices 1, 2, and 3. However, this does not mean that vertex 0 cannot reach vertex 7. Multiple edges can be put together to form a path from one vertex to another. In this case, to get from 0 to 7, the shortest path to take would be 0->3->7. More on shortest paths in the following section.

Shortest path — Breadth First Search (BFS)

In the example above, there are multiple ways to get from vertex 0 to vertex 7. The shortest path being 0->3->7, but other paths such as 0->2->6->7 exist. How do we know which path is the shortest? We can use a method called Breadth First Search.

A Breadth First Search is an algorithm that starts at a given vertex, and visits all the connected vertices before moving on to whatever is connected with that level.

We will first need to choose a starting vertex, let’s say 0. We know from either the adjacency list or adjacency matrix that 0 connects with 1, 2 and 3. Now we have visited 1, 2, and 3 and know that the shortest path is 0->1, 0->2, and 0->3 for these vertices. Now we repeat these steps with the vertices we visited, however it is crucial to not visit the same vertex a second time. Let’s say we look at vertex 1 and see that it connects to vertex 0, 4, and 5. We do not want to visit vertex 0 again, as we already started from there. The shortest path from 0 to 0 is NOT 0->1->0. Continuing the algorithm on 1 will allow us to determine that the shortest path from 0 to 4 is 0->1->4. We know that this is the shortest path because we visited the closest vertices to 0 first, then continued on the chain. We will then continue until all of the vertices are reached, or there are no more vertices to reach.

At the end of this algorithm’s execution, we will have the shortest path from 0 to all of the other vertices, but do make sure to take into consideration cases where a vertex cannot reach another vertex (not applicable in our example).

In the example given, the shortest path from 0 to 1 is 0->1 and the shortest path from 0 to 4 is 0->1->4. When we introduce edge betweenness later, it is important to understand that the edge 0->1 is counted multiple times as both part of the path for 0 to reach 1 and for 0 to reach 4.

How does Girvan-Newman actually find communities?

The Girvan-Newman algorithm works with undirected, unweighted graphs and heavily relies on the concept of edge betweenness to identify communities.

Girvan-Newman works in two major steps, repeating until no edges are left:

  1. Calculate betweenness of edges
  2. Remove the edge with the highest betweenness

It is important to note that betweenness of edges must be recalculated every time an edge is removed.

Computing Edge Betweenness

What is edge betweenness and how is it calculated? Edge betweenness is a measure of centrality of graphs. Note that we have to calculate the shortest path from one vertex to another. We have to answer the question “what path is traveled the most to get from one vertex to another”.

In the example above the edge between vertex 7 and vertex 8 is traveled the most. In order for any of the left side vertex to get to a right side vertex, this path must be traveled. Ex: 1->3->7->8->12 would be the shortest path for vertex 1 to reach vertex 12.

To compute the shortest path between nodes, we use a breadth-first search from all vertices, to all other vertices. There are many different methods to do so and to hold the results, but in the end we will have a list of which edges were traversed, and how many times they were traversed.

The edge that has been traversed the most will then be removed, since it has the highest measure of centrality. After removing the most-traversed edge, we will have to repeat this process until all of the edges are removed.

Selecting Number of Communities

Wait, the Girvan Newman algorithm removes all of the edges in the graph. How is this even useful to find communities then? If we have an graph with no edges, then we will have |V| communities and we did all of this computation when we could have just removed all of the edges. When do we stop?

One simple method of choosing when to stop is to break out of the algorithm when there are n distinct groups. This is only applicable when the number of groups are known, or else multiple different graphs must be produced to see how clustered the groups are.

Another more robust method of choosing when to stop is to calculate the modularity of the graph. For each group we compute the difference between the number of edges between a group and expected number of edges within a group. The graph that produces the highest modularity will be the best graph, with the most dense connections between the distinct groups.

Next Steps

Girvan-Newman is a popular and efficient community detection algorithm however, in search for the best algorithm we wanted to explore other community detection algorithms and believe the Louvain Method as an interesting algorithm to compare with Girvan-Newman. We will detail the Louvain Method’s technique and implementation in the following article.

-> https://medium.com/@hazeleroy/an-alternative-method-of-community-detection-850f9578fb4a <-

Check out our implementation of the Girvan-Newman algorithm https://github.com/smu-cs-3353/22s-pa03-girvan-newman-kaspereroy2

--

--