Most of the systems running businesses today can be represented in the form of a network. In this digital age, networks are an appropriate way of representing entities because all the systems today are well connected. Data represented in the form of a network is also known as Graph. Graphs are a much better way of studying multilevel relationships networks in social media, organised crime links, geographical links, etc compared to relational databases which would need a much larger space to store these relationships. Moreover, the computational complexity of searching for the linked entities would also increase due to the redundant storage of linked elements. In this blog, I will try to explain the basic terminologies of Graph Theory and Community Detection using Louvain’s Algorithm in GPU. Happy reading!!
2. Applications of Graphs :
- Social Media Network: Users are considered as vertices and the friendship between them is treated as an existence of edge between the vertices ( users ). This is useful in friend recommendations.
- Retail Product Recommendations in Retail: Representing the products as nodes and the volume of purchases in the same invoice as edges can unfold new multilevel relationships between the products. These relationships drive the products recommendation for the customers.
- Retail Fraud Networks: Analysis of customers using same credit cards, phone numbers, IP Addresses, devices can be done using graphs to identify the frauds related to collusion among customers involved in stock shrinkage, payments fraud, etc.
- Geographical Maps : Every map application uses graphs to calculate the shortest path between two locations( vertices ) based on the traffic congestion and actual road distance( edges ).
- Unsupervised Machine Learning algorithms like K-Means clustering stores the elements in the form of graphs for calculating the pair wise distances faster.
3. Graph Terminologies Required For Understanding Louvain’s Algorithm
In this section, I will walk you through the graph terminologies which are pre- requisites for understanding Louvain’s Algorithm.
What is a Graph?
A graph is a non-linear data structure containing nodes connected by edges where edges could be directed or undirected. All the terminologies will be explained taking the below sample graph as a reference.
Individual points in a graph which are connected by lines are called nodes. Nodes are labelled from 1 to 8 in the above figure.
Lines or arcs that connect any two nodes in the graph are called edges.
The adjacency matrix also known as connection matrix of a simple labelled graph is a matrix with rows and columns labelled by graph nodes with a 1 or 0 in position ( ni, nj ) based on whether the nodes are adjacent or not. In the below matrix, the row and column labels represent the nodes and the matrix elements represent the existence of an edge. We can observe that the edges are marked with 1 and pair of vertices with no link/edge between them are marked as 0.
Directed and Undirected Graphs:
Graphs for which edges do not have a direction are called as undirected graphs while directed graphs have edges with a particular direction associated with them. In any social network website, if individuals are nodes then friendship between them will be an undirected link and hence the graph will be called undirected graph. If the edges represent who sent or received the friend request, it will result into a directed graph.
In any graph structure if the nodes can form multiple groups such that the nodes are much more associated/linked to nodes within the groups compared to nodes in the other groups then these groups are said to form communities. Below figure represents a graph with two communities identified in blue and orange respectively. The edge length represents the weight of the edge. Based on the edge length, nodes 1,2,3 and 4 in blue communities are much closer to each other as compared to nodes 5,6,7,8 and 9. This also applies to orange nodes.
In the below sections we will explore one of the most commonly used community detection algorithms which is known as Louvain’s Algorithm.
4. Louvain’s Algorithm for Community Detection:
Louvain’s algorithm was proposed by Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte and Etienne Lefebvre in this paper in 2008. It was named after the city in Belgium where one of the proposers Etienne Lefebvre first developed this algorithm during his Master thesis at UCL( Louvain-la-Neuve ) in March,2007.
Introduction and Cost Function
Any unsupervised or supervised machine learning algorithm needs a loss/optimisation/cost function to decide on the convergence criterion. In the community detection scenario, Modularity is one most suitable optimisation metric. Louvain’s algorithm is based on optimising the Modularity very effectively. Before discussing the steps followed in the algorithm, let us understand Modularity concept first.
The quality of the communities referred as partitions hereafter is measured by Modularity of the partition. Modularity Q is defined as the formula shown in the below figure.
To explain it better, let us take an example of a small graph.
As per the definition of adjacency matrix Aij explained in the previous sections, it will look like as shown below for this graph.
Consider the partition assignment as:
partition 1– 1 , 3 , 4
partition 2– 2 , 5
Number of Links, m = 5
ki, kj are the degree of the respective nodes ( 2 in our example)
d(x,y) is 0 if nodes belong to the same partition else it is 1.
Using the formulae of modularity,
Q = 1 / (2 * 5) ( (0–2 * 2 / (2 * 5) ) * 1 + # node 1 to 1 -> absent, same membership
(1–2 * 2 / (2 * 5) ) * 0 + # node 1 to 2 -> present, different membership
(0–2 * 2 / (2 * 5) ) * 1 + # node 1 to 4 -> absent, same membership
(0–2 * 2 / (2 * 5) ) * 0 + # node 1 to 5 -> present, different membership
(1–2 * 2 / (2 * 5) ) * 1 + # node 1 to 3 -> present, same membership
Continuing for the rest of the matrix eventually simplifies to:
Q = 1 / 10 ( 7 * (0–2 / 5) + 6 * (1–2 / 5) ) = 4 / 50 = 0.08
Thanks to this answer from Matthew Joseph in ResearchGate to share a dry run of modularity calculation. The explanation is adopted from the same example. The modularity score ranges between -1 and 1 and higher the modularity score means the communities detected are good and more tightly knit.
Steps followed in the algorithm:
Algorithm broadly follows two steps (A and B) which are repeated iteratively.
A. Let there be N nodes in a graph network. In our example, we have N = 9 nodes as shown in the below figure. Each edge has certain weight assigned to it.
- To start with each node is assigned to a different community or partition. The number of partitions is equal to number of nodes N.
2. For each neighbour j of node i, it is checked if the overall modularity increases by moving i from its partition to partition j. The node i is moved to partition j for which the gain of modularity is highest ( gain should always be positive ). In case, none of the gains observed are positive, i retains the original partition label.
3. Step 2 is repeated for all the nodes in a sequential manner. This is called one iteration. The iteration is also repeated till no improvement in modularity can be achieved. The idea is to reach a local maximum of modularity after which no further increase in modularity can be possible. Please note that a node can be and most likely visited more than once to evaluate the change in modularity by moving its neighbours to different partitions.
B. The second step involves rebuilding a new network by grouping the nodes together labelled in the same community (merging individual nodes ) in step A.
The weights of the edges between the two new communities ( containing more than 1 node ) are determined by adding up the weights of edges from every node in one community to another. The below images will complement my explanation of the algorithm steps.
Edges within the nodes of same community form self-loops. This means even the diagonal elements in adjacency matrix will be non-zero.
One iteration of steps A and B is called a pass. After the first pass, algorithm attempts the second pass and so on. After every pass, the number of communities reduces. This continues till there is no change in community label changes are observed and a maximum modularity is reached. Most of the computation happens in the first few passes and reduces exponentially in the final passes.
5. Implementation in GPU
In this section, I will walk you through the implementation steps of Louvain’s Algorithm in CUDA RAPIDS GPU installed on Microsoft Azure.
GPU flavour: Volta-100 GPU
CUDA- version 9.2
Rapids- version 0.9
Cloud Service Provider : Microsoft
Azure Instance Type : N-Series NCsv3
We will use a dummy dataset which consists of number of messages exchanged between pairs of customers of a messaging service/app. The customers are the nodes and the number of messages exchanged between the customers ( nodes ) are the respective edge lengths/weights.
Two CUDA libraries CUDF and CUGRAPH will be used. CUDF is the CUDA equivalent of Pandas in vanilla python which is used for working with data frames and exploratory data analysis. CUGRAPH contains the graph based algorithm implementations available in CUDA RAPIDS. Refer the link to know more about working in machine learning in RAPIDS.
- Import the required libraries
2. Import dataset
Fields Inviter and Invitee are the nodes and the MsgCount field contains the edge weights. Field Num represents the edges/link between nodes.
3. The nodes should be input as integers. We will have to remove the character ‘P’ from Inviter and Invitee columns.
Check the data. src_node and des_node are the new fields after removing ‘P’.
4. Convert the datatypes of nodes to integers and edge weights to float.
5. Louvain algorithm implementation of CUGRAPH requires nodes starting at zero. We will have to subtract each node by 1 as it starts for 1 in our dataset.
src_0 and dest_0 are the fields representing the nodes to be fed to the Louvain algorithm.
6. Next step is to convert this structured tabular data into a Graph object .
We will create an empty graph.
Then add nodes and edges from the dataframe df_chat.
The total number of nodes in our graph is 131.
7. And the moment of reckoning!! Run Louvain community detection algorithm and store the community information in df_chat_partition object and the modularity score in variable mod.
The modularity score of 0.94 means that we have detected very closed communities :) as discussed in the section Introduction and Cost Function of this blog before.
Every node will have a partition/community assigned to it.
9. Check the number of partitions/communities and explore members of each community.
47 communities are detected.
I have shown few samples of communities which are formed. For example nodes 15,76 and 81 belong to community ‘Partition 9’ which means members P15, p76 and P81 tend to exchange more messages among each other as compared to any other member.
This is how you can apply Louvain’s algorithm to any dataset of your choice to cater to similar requirements. The code structure is very similar to python based pandas code and can be easily adopted by anyone who is comfortable in data analysis with python.
I hope this blog gave you a good overview of community detection in Graphs in general and Louvain’s Algorithm in particular. Currently, I am exploring the strategies to community tag the nodes newly added to the dataset on an incremental basis. I will try to share more details on the same in one of my future blogs. Till then, explore the possibilities of community detection in your respective organisation, institution and business wherever you belong to. Thanks a ton for reading!!