Effects of Parallelization on the Girvan-Newman Algorithm’s Runtime

Published in

smucs

6 min readNov 17, 2021

Rayaan Irani, is a Sophomore majoring in Computer Science at SMU and pursuing a minor in Business Administration. He is passionate about Machine Learning, Cloud Computing, and the interaction between humans and machines.

Saaketh Koka, is a Junior pursuing a Bachelor’s of Science in Computer Science at SMU. He loves working with large financial datasets and developing algorithms to trade stocks on his behalf.

By utilizing the multi-threading capabilities offered by the C++ Standard Template Library during the calculation of node betweenness, we were able to reduce the run time of the Girvan-Newman community detection algorithm of weighted graphs by 43%.

The Girvan-Newman Algorithm

The Girvan-Newman community detection algorithm is a community detection algorithm which seeks to find nodes within a graph which are highly interlinked (communities). The algorithm does so by calculating the edge betweenness centrality of every node in the graph, selecting the node with the highest centrality, removing it from the graph, and then repeating until the graph has been divided into a desired number of communities. The reason why this algorithm works is because edge betweenness centrality is in effect a measure of how useful an edge is at getting between any two nodes in a graph. As a result of this, edges with high betweenness are likely not to lie within the center of a community and rather will be the connectors between communities. By taking advantage of this fact, Girvan-Newman removes the edges which are the connectors between communities causing the communities to slowly become isolated from each other.

Calculating Betweenness

Using Girvan-Newman to detect communities requires the calculation of the betweenness for every node in the graph at every iteration. This is the most computationally taxing portion of the Girvan-Newman algorithm. In unweighted graphs, this metric is calculated by determining the number of shortest paths between all nodes which pass through the edge in question. We apply this concept to weighted graphs by applying the Dijkstra’s algorithm to find the least costly path between any two nodes. The method we used to calculate the betweenness of the edges in the graph took advantage of the fact that the amount of computation needed to calculate the betweenness of any one node in the graph is roughly equivalent to the computation needed to find the value for all the nodes in the graph. This optimization is coupled with the derivation that the edge with the highest betweenness will be the edge which is present on the most shortest paths as the number of possible shortest paths will be constant for all edges in a particular edge.

Our Method

Our implementation iterates through all vertices in the graph and executes Dijkstra’s algorithm which returns the least costly path between that node and every other node in the graph. We then iterate through all the paths in each of the results to identify the number of times that each edge appears on those lists. Then an additional scan identifies the edge with the highest number of appearances in a shortest path which is returned to the Girvan-Newman algorithm to be deleted. A copy of our code for this method is inserted below.

This method is quite inefficient. It takes a significant amount of compute resources to compute all the shortest paths from any individual node to any other and this process does not need to be finished in any order. As a result, this process is a candidate for parallelization for speed gains.

Adding Parallelization

The addition of parallelization will substantially decrease the amount of time it takes for the program to execute. This is because while the total amount of computing resources needed for the program to finish running does not change, we are able to place large amounts of that work on different cores which run simultaneously with the main core allowing for the total amount of time the program takes to run to decrease. The most optimal places in which parallelization should occur are when tasks are done in large numbers, are computationally complex, and the order in which items in the sequence are done are not particularly important. For these reasons, we decided to parallelize the calculation of Dijkstra’s algorithm for each node in the graph. To implement this parallelization, we utilized the C++ Standard Template Library’s std::thread functionality. This method, alongside the std::future and std::promise commands which are used for storing return values, allows us to access the additional computing cores of all modern computers. Because there is a large constant time cost to the creation of and returning values from threads, the number of times in which a new thread is created should be minimized. To minimize the number of calls, we give each of the cores in the specific device which is running the program an equal number of nodes to compute. The splitting of the nodes into equal lists and the combination of the results of the threads is done in the parallel Dijkstra’s method. This method employs a thread runner method as a helper as a direct set of commands for each of the threads to implement. Those methods are inserted below.

Comparison of Times and Analysis

As projected, the time for the parallelized version to execute is substantially lower than that of the non-parallelized version. To test the run time of each of the versions of our program, we compared the time which it took for them to identify the communities within collegiate football games played during the 2000 season. In collegiate football, many teams are members of conferences which tend to play each other more than non-members. As a result, natural communities exist within the dataset making it prime for use by the Girvan-Newman algorithm. The difference in the execution time between the two methods was substantial. Using timing from STL’s chrono, it took 13.83 seconds for the parallelized version to compute on a 2020 M1 MacBook Pro which has 8 cores. This is compared to the non-parallelized version which ran on the same hardware in 24.28 seconds. This timing test indicates that the parallelized version is approximately 43% faster than the non-parallelized version. This result informs us of several things. First, it indicates that the calculation of Dijkstra’s for each of the nodes is a computationally significant task. For parallelization to have a significant impact on the run time of a program the process being parallelized must comprise a significant amount of the program’s run time. This decrease does indicate that. However, the degree to which the better algorithm was faster also indicates that substantial computational work needed outside the calculation of Dijkstra’s algorithm. As the computer which the program was run on has 8 cores, the expected run-time of a parallelized algorithm if all the work was done in calculating Dijkstra’s would have been around 87%. This is far from the actual decrease.

Conclusion

In conclusion, by multi-threading the calculation of Dijkstra’s algorithm for all nodes in a graph, we were able to reduce the time needed for the Girvan-Newman algorithm to execute by 43%.

Effects of Parallelization on the Girvan-Newman Algorithm’s Runtime

Written by Rayaan Irani