Improving Performance for Girvan-Newman Community Detection

Landon Wood
smucs
Published in
4 min readNov 17, 2021

Landon Wood is a third-year Computer Science and Mathematics student at Southern Methodist University. His hobbies include playing video games and spending time with his dogs.

Maria Harrison is a junior CS student also at Southern Methodist University. She likes AVL Trees but despises B+ trees.

Community Detection and the Girvan-Newman Algorithm

Community detection is one of the most important applications of graph theory. By iterating over a graph containing potentially thousands of nodes and vertices, we can extract information that will show us which nodes interact with others more frequently. This information is invaluable for real-world applications, such as social networks.

For our third CS 3353 project, we were tasked with implementing the Girvan-Newman algorithm to detect communities of nodes in an unweighted, undirected graph. The Girvan-Newman algorithm takes a divisive clustering approach, repeatedly identifying the most important edges in the graph and removing them. This repeats until there are no more edges in the graph.

How do we identify which edges are the most important? We use a concept called edge betweenness centrality. As defined by Girvan and Newman, edge betweenness centrality means “the number of the shortest paths that go through an edge in a graph or network.” Each edge has a specific betweenness centrality value, and we calculate it at each iteration of the algorithm.

Here is the pseudocode for the Girvan-Newman algorithm:

1. For every edge in the graph, calculate the edge betweenness centrality.
2. Remove the edge with the highest edge betweenness centrality.
3. Calculate the betweenness centrality for every remaining edge.
4. Repeat steps 2-4 until there are no more edges left in the graph.

Our implementation of the Girvan-Newman algorithm is actually a slight modification to the original. We followed the idea proposed by Despalatović, Vojković , and Vukičević: instead of only removing one edge with the highest betweenness centrality, we remove all edges with the highest betweenness centrality (as more than one edge can have the same maximum value). Our algorithm now becomes:

1. For every edge in the graph, calculate the edge betweenness centrality.
2. Remove all edges with the highest edge betweenness centrality.
3. Calculate the betweenness centrality for every remaining edge.
4. Repeat steps 2-4 until there are no more edges left in the graph.

The only difference is in step 2. However, this small tweak reduces the total number of calculations and iterations that the algorithm must perform. Additionally, to further increase the performance of our algorithm, we utilized all the cores of our processors. Both of our computers are 8-core Intel processors, specifically an i7–9700 and an i9-series. In our IDE, CLion, the settings do not enable all of the CPU’s cores to be used by default. By changing an advanced setting, we can tell CLion to use all the processor’s cores in execution, resulting in faster performance overall. The combination of these two improvements gave us a noticeable increase in execution speed — more on that below.

Results

Now, let’s take a look at the results of our implementation. To test our algorithm, we used two graphs: a simple example graph we found online, and a graph built from a dataset of college football games played in the Fall 2000 season. The goal is to identify communities within these datasets. The simple graph has 6 vertices with 7 edges, and the football graph has 114 vertices with 612 edges.

Here is our algorithm’s output for the football conference dataset:

We can clearly see that the nodes in the graph are grouped up into different communities. These communities represent which teams played each other and similar opponents in the season, indicating conference alignments.

Both the original and the improved Girvan-Newman algorithm give the same resulting graph. However, as stated earlier, our improved algorithm results in a performance increase when calculating this same result. Using the C++ std::chrono library to time our execution, we got the following data:

As you can see, execution for the football graph is around 2.3 seconds faster when utilizing our improvements. For the simple graph, execution is still very fast either way, but again, our improvements result in a faster execution time. For smaller datasets, the difference is somewhat negligible. However, if we were to extend our testing to massive data sets, we would see greater improvement, as we saw with the football dataset.

Unfortunately, we were unable to test on massive data sets. Although we attempted to time execution on massive datasets, such as those provided by the Stanford Network Analysis Project, our local computers are simply not powerful enough to process them in a timely fashion. And by timely fashion, we mean before the project deadline has passed. Had we finished our project earlier, we would have liked to attempt to utilize SMU’s ManeFrame II to truly see how our improved algorithm holds up.

A GitHub repository containing our implementation can be found here.

--

--