Community Detection in Networks — Taking edge betweenness centrality two steps further — Pt. 1

Joshahascall
smucs
Published in
4 min readApr 30, 2022

By: Josh Hascall and Alex Shockley

Girvan-Newman algorithm in action! Source: football dataset

We are particularly interested in the constant improvement and optimization of everything we do, especially with our code. We had some particularly interesting results while trying to improve a specific aspect of the Girvan-Newman algorithm. Ultimately, we wrote three algorithms for the same task, calculating edge betweenness centrality, and tested our algorithms for speed and efficiency. What we found shocked us.

When we wrote our first custom implementation of the Girvan-Newman algorithm, we were particularly fascinated, and stumped, by one of the most critical inputs of the algorithm, and the best way to calculate it. Edge betweenness centrality while a mouthful, is also a relatively complicated concept in networks. Edge betweenness centrality is a measure for determining how important an edge is in connecting all the nodes together. More mathematically, the edge betweenness centrality of е is defined as:

Source: Networkx

where V is the set of nodes, 𝞂(s,t) is the number of shortest (s,t)-paths, and 𝞂(s,t) is the number of those paths passing through edge e.

What we were most interested in was seeing how algorithms for calculating these figures would perform, for time and accuracy, as datasets became randomized and sufficiently large. As both of us have interest in large social networks, we are curious how to analyze and detect communities within these networks without requiring an infeasible amount of computing power. The three iterations of calculating this figure and our analysis were the attempt to see what would work best in the real world.

When we first started implementing our first Girvan-Newman iteration, we tried to brute-force our way to calculating edge betweenness centrality. We devised a breadth first search (BFS) algorithm that checked the betweenness centrality of every node within the graph.

The pros to this method was that it was easy to understand and quick to implement; you are also guaranteed to get the correct centrality numbers because you’ve checked every node. However, this method for calculating edge betweenness centrality is pretty terrible. As you can see from the pseudocode, this algorithm is O(v³) time complexity. As soon as graphs get larger than a few dozen edges, calculating the edge betweenness with this algorithm becomes computationally infeasible. The main contributor to this large time complexity is due to how computationally expensive it is to sum pair-dependencies.

Seeing the obvious pitfalls of BFS, we decided to iterate further and optimize this calculation. We then “stumbled” upon Brandes’ algorithm. Going off of the paper, we wrote the following pseudocode:

Brandes’ algorithm tries to reduce the computational cost of summing pair-dependencies by doing a partial sum of them in each iteration of the breadth first search. While it’s more complicated to implement, it’s substantially quicker. Instead of O(v³) time, Brandes’ is O(vm) and O(vm + v² log v) time for unweighted and weighted graphs, respectively, where m is the number of links. This algorithm is so good that it’s what the networkx library, the library that handles most of all graphs and networks in python, uses. Indeed, our implementation is significantly faster than BFS, especially as datasets get larger.

As we thought about it, Brandes’ algorithm really isn’t feasible for extremely large networks; even (nm) time becomes too large quickly! We wanted to see if there was another potential way to calculate edge betweenness centrality. After some digging, we found this paper written by German researchers. Instead of calculating edge betweenness centrality for every node, this paper argues that betweenness centrality of a network can be approximated. Further, with a little bit of advanced statistics and extremely large datasets, this method for calculating betweenness centrality becomes remarkably accurate and fast.

Because neither of us are statistics majors, the math was slightly above our comprehensive abilities (see brain-melting chart below)

A slight step up in complexity from our implementation! Source: Geisberger, Sanders, Schultes

and we didn’t have multi-gigabytes of network data to test on, we didn’t implement their algorithm exactly. That being said, we wanted to capture the essence of their algorithm and develop a heuristic for estimating, rather than calculating, betweenness centrality for a given graph. We wanted to check and see if having a 50% chance of skipping a vertex would improve runtime and keep the accuracy our output within an acceptably high range.

Our thinking was that an algorithm that skips half the vertices is going to be half as good, however, this was NOT the case.

In part two, we’ll discuss the results and our main takeaways from the testing and implementation of these algorithms.

--

--