Community Detection in Networks — Taking edge betweenness centrality two steps further — Pt. 2

Alex Shockley
smucs
Published in
5 min readApr 30, 2022

In part 1, we discussed the process for implementing three algorithms to calculate edge betweenness centrality of a vertex in a graph. In this part, we’ll be analyzing these algorithms for speed and accuracy. We will also discuss how these results impact the real world network-analysis.

Analysis Methodology:

Conducting our time analysis was fairly simple. We generated lots of random graphs using the Erdos Renyi method in the Networkx library and tested the time for each of the three algorithms to run the full Girvan-Newman algorithm on them. See some examples of these graphs and the communities color-coded below!

Randomly Generated Graph #1
Randomly Generated Graph #2
Randomly Generated Graph # 3

Something we didn’t think of was that randomly generated graphs probably don’t have very many strong networks. It’s interesting to think that networks aren’t random structures, while they’re difficult to visualize and they naturally occur, there isn’t very much randomness to them. Instead, graphs from real-world data form much more obvious networks as seen with some networks we ran below:

Football Dataset Clustering
Political Books Clustering

Because the real data networks have real networks in them, we tested the accuracy of our third implementation against the first two with these graphs. The first two graphs always returned the same result. We calculated accuracy by calculating the standard error between the third implementation and the implementation that used the Brandes algorithm as follows:

Where AM is the average modularity across all clusters in the graph and modularity is defined as:

This left us with a percent error value where positive values signify that the traditional Brandes algorithm was better and negative values signify that the predictive implementation was better.

Time Analysis

We found the most telling time analysis was on our runs of 30 and 100 node sets. As you can see with the 100-node set, the exponential time of these algorithms really started to take its toll on the time performance of our algorithms, especially as they got denser!

We were fascinated with the worse performance of Brandes’ as the edge probability increased. This can be attributed in part to the fact that these timing tests were run on the Girvan Newman algorithm as a whole and since this algorithm runs until there are no edges, there should be a linear relationship.

What we liked in this implementation was the performance of our approximation algorithm. In all cases, the algorithm took about half as much time to run, and as you can see, it makes a huge difference in run time as seen in this graph!

Accuracy analysis

While our Brandes’ approximation implementation was promising for time complexity, we were still skeptical. If it’s half as accurate, then it’s not really worth the time saved. To make sure that wasn’t the case, we compared the output of our approximated algorithm to the standard Brandes algorithm to see if they were close using the percent error calculated above.

Surprisingly, our algorithm had a lower error than the standard Brandes algorithm in the dataset of football conferences. Even when it had a positive error, it was still shockingly small for taking less than half the time to compute. We were intrigued by this result and it is definitely worth looking into, especially the approximation’s odds of selecting a given vertex from the graph. Further experimentation would be necessary to test whether a different selection technique would yield better results.

Real-world implications

Network analysis on increasingly larger datasets is a necessity for computing in the modern era. The applications for network analysis are vast; as the german paper outlined, there are applications for transportation (roads, public transit, walking pathways), social networks, computer networks, and bioinformatics (e.g. protein interaction networks).

While in a perfect world, running the Girvan-Newman algorithm would give the best output, our implementation proved that this becomes infeasible for a dense network of just a few hundred nodes. Instead of exact calculations of edge-betweenness, approximations need to be made to analyze these networks. If our implementation can produce fewer than 15% error, then a more statistically sound implementation should be able to reduce error and run time substantially, and open the door for network analysis of extremely large, and important networks.

If we had more time to focus on this specific problem, we would like to throw our implementations at increasingly larger real-world datasets and see how it performs. Moreover, we would like to test different methods for determining the most important edges to calculate edge betweenness, in non-exponential time.

--

--