Comparing the Performance of Community Detection Algorithms in Python’s NetworkX and C++’s Boost Graph Library

Published in

smucs

4 min readApr 30, 2022

Paige McFarlain is a Sophomore at Southern Methodist University, pursuing a Bachelor of Science in Computer Science and a minor in Business.

I was recently tasked with implementing the Girvan Newman algorithm for community detection relying primarily upon C++’s Boost Graph Library and during my countless hours of desperate googling, I realized I was asking a question that no one cared enough to answer. No one was going out of their way to implement Girvan Newman in C++ because it had already been done in a shinier, newer language: Python.

What is the Girvan-Newman algorithm?

The Girvan-Newman algorithm is a community detection algorithm developed by Michelle Girvan and Mark Newman. The primary functionality of the algorithm is centered around finding edges that are most common in between communities and removing said edges from the network. This algorithm relies heavily on the concept of edge betweenness, a measurement of how likely a given edge is to be found between the shortest paths from one point in the network to another. The measurement of edge betweenness is not authoritatively decided upon but in my implementation of the Girvan-Newman algorithm, I decided to go with a more basic heuristic. To calculate edge betweenness, my algorithm uses Breadth-First Search to identify the nodes used in the shortest paths from any given node on the map to another. With this data, it then identifies which edge is most commonly used in these shortest paths by traversing the shortest paths and counting each edge that is used. The edge that is featured the most is hereby removed in accordance with the Girvan-Newman algorithm and the process repeats.

Another important concept to the Girvan-Newman algorithm is that of modularity. If one follows the Girvan-Newman algorithm through to its entirety, they will be left with a disconnected cluster of nodes. To decide when to pause the algorithm to ensure the ideal clustering of communities, one must calculate a modularity value. The best equation for this number is also somewhat contested so I decided to implement my own (albeit limited) heuristic for this value. I believe that an ideal stopping point for the Girvan-Newman algorithm is when it begins to stabilize. The Girvan-Newman algorithm focuses on removing edges that may be deemed outliers in a given community. Take this simple graph below for example.

Edges (A, B) and (E, F) have only 1 shortest path passing through them. Whereas edge (C, D) has 9, more than double any of the other edges. This makes (C, D) an outlier. Thus, following the Girvan-Newman algorithm, this edge is removed resulting in the graph below.

As you can see, each of the edges have identical edge betweenness. This is what I would consider a “stable” graph. To determine the stability of a graph, I keep track of the minimum and maximum edge betweenness values. When the minimum edge betweenness divided by the maximum edge betweenness equals 1, the graph is 100% stable and the algorithm can exit.

Of course, not every situation is quite so simple. For most graphs, there is no equilibrium for them to reach. For others, the equilibrium lies only in impossibly small communities. So through trial and error, and comparisons against pre-existing community data [1], I determined that the ideal stability lies around 70%.

So clearly, I made quite a few changes to the average heuristic followed in a Girvan-Newman implementation. After these changes, I was curious to see how my implementation performed against Python’s NetworkX Girvan-Newman algorithm.

While the Python implementation consisted of 30 lines of code including code to output a graph visualization, my C++ implementation spanned 3 files with about 300 lines of code altogether, not including the Python file needed to convert GML to GraphML.

Python, however, performed consistently faster and more accurate than my algorithm. This is not even a disappointing realization, however, considering the power of technology that can perform such quick and accurate calculations with minimal effort.

References:

[1] http://www-personal.umich.edu/~mejn/netdata/

Comparing the Performance of Community Detection Algorithms in Python’s NetworkX and C++’s Boost Graph Library

Written by Paige McFarlain