Graph Analytics: Detecting communities in the graph using Neo4j

Louvain algorithm, Label propagation, graph components, Triangle Count & Local Clustering Coefficient

Mehul Gupta
Data Science in your pocket
9 min readJun 25, 2022

--

After covering graph databases basics, pathfinding & centrality algorithms with Neo4j implementation, it's time to move towards Community detection in graphs which is of great use in the real world be it recommending friends on social media apps, detecting frauds, etc.

You should go through the previous 3 parts for a better understanding

Graph analytics basics

Pathfinding algorithms in graphs

Finding important nodes in graphs

Before getting started, we must start off with understanding graph components

Graph Components

A graph component is a sub-graph within a graph such that all the nodes in the subgraph are

Have a path to another node in the subgraph (this require some clarification that we will discuss later in the post).

Can’t reach nodes of other sub graphs in the graph

So, it can be the case that in graph ‘G’, we may have multiple components where any node from a component ‘A’ can’t reach any node in other components. There are 2 types of graph components depending upon the path considered between the nodes of a subgraph for declaring the

  1. Weakly Connected Components (WCC): When the direction of edges between the nodes are ignored while considering the path between 2 nodes, it forms a weakly connected component. So if we have A →B, we consider it as A — B (bidirectional) & then look for paths if available
  2. Strongly Connected Components (SCC): When the direction is considered between the edges given the nodes for forming a component, it is called a Strongly connected component

Confusing? Let’s understand by an example. Assume this to be a Social Media network where each node

Now, if we wish to figure out

  1. Weakly components: (Michael, Bridget, Alice, Charles), (Mark, Douglas)

We have 2 WCC in the above graph as there exists at least one path (irrespective of the direction of the edge) to reach from one node to another. So if you wish to reach Charles to Michael, we can as

Charles — Alice — Michael (remember to ignore edge direction)

But there exists no path from any node of the Component (Mark, Douglas) to the other component hence the graph has 2 weakly connected components

2. Strongly Connected component

(Michael, Bridget, Alice), (Charles), (Mark, Douglas)

Why Charles is out of the component which it was a part of in the case of WCC?

In the case of SCC, each node of the component should be reachable by another node (considering the direction of the edge). As Charles has no outgoing edge, you can’t move any edge from Charles to any node hence it is left out. In the remaining 2 components, each node is reachable by all other nodes (after considering edge directions also)

Let’s query our old airline dataset to determine weakly connected components. Before this, we need to load the dataset from a CSV in some DB & form a projection as well as we did in my past blogs (all explanations around the cypher queries can be found here)

The dataset

Loading data from CSV

CALL apoc.load.csv("flight_network_spicejet.csv")YIELD list as itemMERGE(n:city{name:item[0],lat:toInteger(item[3]),long:toInteger(item[4])})MERGE (m:city{name:item[1]})MERGE (n)-[:flight{distance:toInteger(item[2])}]->(m)

Creating a projection

CALL gds.graph.project("airline",{city:{properties:["lat","long"]}},{flight:{properties:"distance",orientation:'UNDIRECTED'}})

resulting in

I have already discussed the above queries in detail in my last blogs & hence skipping any explanation this time.

Let’s determine Weakly Connected Components in our graph

CALL gds.wcc.write('airline', { writeProperty: 'community' }) YIELD nodePropertiesWritten, componentCount;

This query will help us write the community detected for each node in the graph as a property ‘community’ for the node. Do you remember the difference between Write & Stream? time to revisit if no

In our graph, we got just one community hence the entire graph is a single weakly connected component. Do observe the output below

As we can see in the above video

  • Only 1 weak component was detected. hence, all nodes belong to one community only
  • A new property has been added to each node ‘community’=0 depicting the weak component id. If we have had multiple communities detected, this community property would have taken values from 0-n
  • Similarly, we can detect Strongly connected components by replacing gds.wcc.write with gds.scc write in the above query.

Find the vlog version for WCC & SCC below:

Label Propagation algorithm

One important utility of community detection is in semi-supervised learning & data annotation where we can assign labels to unlabeled data using a small labeled dataset as reference. Though, it works perfectly fine when no labels are provided as well.

The algorithm goes like this

  • Assign every node a unique community such that total communities = total nodes
  • For every node, using majority voting, assign a label based on the labels of its neighbors. In case we have multiple contenders (no single label being common or 2 or more labels with the same frequency in neighbors), we can have an alternate rule like 1) distance from node 2)random assignment, etc.

What is majority voting?

Nothing but neighbors with the most common label will be assigned to the current node. Can be considered something similar on the lines of a K-Nearest Neighbors.

  • We run the same process for multiple iterations where in one iteration, we go through all nodes & update their labels based on Majority voting.

After n iterations, nodes with the same labels are clubbed from the same community.

In the case of semi-supervised learning, instead of going with each node with a unique label, we can use labels from the dataset & continue the entire process as above.

Let’s run an iteration for label propagation for data labeling over the below graph

Assuming we have labels for A, B & C & wish to label out E & D. Now for E

  • We have 2 edges to A & C respectively, hence by majority voting, we will assign it the ‘Red’ Group
  • For D, we have a tie break as it has edges to A & B which belong to different groups. By, say, we follow Random as a tie-breaker policy, D is assigned the ‘Green’ group after the iteration

let's run label propagation on our airline network. We would be considering all nodes in different communities & run label propagation using weights (distance between the nodes we already have)

CALL gds.labelPropagation.stream('airline')YIELD nodeId, communityId AS CommunityRETURN gds.util.asNode(nodeId).name AS Name, CommunityORDER BY Community, Name

I haven’t posted the results as like in the case of WCC, here also every node gets assigned to a single community

Louvain Algorithm

As discussed in my last post in quite a detail, we won’t be deep-diving into its explanation but would definitely see how to use it with Neo4j. You can find the vlog version below.

CALL gds.louvain.write('airline', { writeProperty: 'community' })YIELD communityCount, modularity, modularities

let’s see how many communities we get using Louvain

╞════════════════════╪═════════════╡
│"Belgaum" │8 │
├────────────────────┼─────────────┤
│"Bengaluru" │8 │
├────────────────────┼─────────────┤
│"Calicut" │8 │
├────────────────────┼─────────────┤
│"Chennai" │8 │
├────────────────────┼─────────────┤
│"Coimbatore" │8 │
├────────────────────┼─────────────┤
│"Gwalior" │8 │
├────────────────────┼─────────────┤
│"Hubli" │8 │
├────────────────────┼─────────────┤
│"Hyderabad" │8 │
├────────────────────┼─────────────┤
│"Kochi" │8 │
├────────────────────┼─────────────┤
│"Madurai" │8 │
├────────────────────┼─────────────┤
│"Mangalore" │8 │
├────────────────────┼─────────────┤
│"Tirupati" │8 │
├────────────────────┼─────────────┤
│"Vijayawada" │8 │
├────────────────────┼─────────────┤
│"Visakhapatnam" │8 │
├────────────────────┼─────────────┤
│"Pondicherry" │8 │
├────────────────────┼─────────────┤
│"Port Blair" │8 │
├────────────────────┼─────────────┤
│"Rajahmundry" │8 │
├────────────────────┼─────────────┤
│"Thiruvananthapuram"│8 │
├────────────────────┼─────────────┤
│"Tuticorin" │8 │
├────────────────────┼─────────────┤
│"Bagdogra" │15 │
├────────────────────┼─────────────┤
│"Dibrugarh" │15 │
├────────────────────┼─────────────┤
│"Guwahati" │15 │
├────────────────────┼─────────────┤
│"Kanpur" │15 │
├────────────────────┼─────────────┤
│"Kolkata" │15 │
├────────────────────┼─────────────┤
│"Patna" │15 │
├────────────────────┼─────────────┤
│"Varanasi" │15 │
├────────────────────┼─────────────┤
│"Lilabari" │15 │
├────────────────────┼─────────────┤
│"Silchar" │15 │
├────────────────────┼─────────────┤
│"Ahmedabad" │34 │
├────────────────────┼─────────────┤
│"Amritsar" │34 │
├────────────────────┼─────────────┤
│"Bhopal" │34 │
├────────────────────┼─────────────┤
│"Goa" │34 │
├────────────────────┼─────────────┤
│"Jaipur" │34 │
├────────────────────┼─────────────┤
│"Jodhpur" │34 │
├────────────────────┼─────────────┤
│"Surat" │34 │
├────────────────────┼─────────────┤
│"Udaipur" │34 │
├────────────────────┼─────────────┤
│"Pune" │34 │
├────────────────────┼─────────────┤
│"Aurangabad" │39 │
├────────────────────┼─────────────┤
│"Darbhanga" │39 │
├────────────────────┼─────────────┤
│"Dehradun" │39 │
├────────────────────┼─────────────┤
│"Delhi" │39 │
├────────────────────┼─────────────┤
│"Jabalpur" │39 │
├────────────────────┼─────────────┤
│"Jammu" │39 │
├────────────────────┼─────────────┤
│"Kandla" │39 │
├────────────────────┼─────────────┤
│"Leh" │39 │
├────────────────────┼─────────────┤
│"Mumbai" │39 │
├────────────────────┼─────────────┤
│"Srinagar" │39 │
├────────────────────┼─────────────┤
│"Rajkot" │39 │
├────────────────────┼─────────────┤
│"Porbandar" │39 │
└────────────────────┴─────────────┘

As neo4j doesn’t support visualization based on property, we are skipping this

Triangle Count & Local Clustering Coefficient

As the name suggests, Triangle count is nothing but a community of 3 nodes such that they form a triangle in a graph i.e. have an edge over one another irrespective of the direction.

Let’s see the triangle count for each node in our airline network

The query

CALL gds.triangleCount.stream('airline')YIELD nodeId, triangleCountRETURN gds.util.asNode(nodeId).name AS name, triangleCountORDER BY triangleCount DESC

The output

│"name"              │"triangleCount"│
╞════════════════════╪═══════════════╡
│"Delhi" │157 │
├────────────────────┼───────────────┤
│"Mumbai" │146 │
├────────────────────┼───────────────┤
│"Bengaluru" │142 │
├────────────────────┼───────────────┤
│"Hyderabad" │128 │
├────────────────────┼───────────────┤
│"Chennai" │122 │
├────────────────────┼───────────────┤
│"Ahmedabad" │105 │
├────────────────────┼───────────────┤
│"Kolkata" │97 │
├────────────────────┼───────────────┤
│"Jaipur" │80 │
├────────────────────┼───────────────┤
│"Patna" │73 │
├────────────────────┼───────────────┤
│"Guwahati" │55 │
├────────────────────┼───────────────┤
│"Varanasi" │52 │
├────────────────────┼───────────────┤
│"Pune" │43 │
├────────────────────┼───────────────┤
│"Surat" │40 │
├────────────────────┼───────────────┤
│"Bagdogra" │28 │
├────────────────────┼───────────────┤
│"Bhopal" │25 │
├────────────────────┼───────────────┤
│"Kochi" │25 │
├────────────────────┼───────────────┤
│"Udaipur" │24 │
├────────────────────┼───────────────┤
│"Goa" │21 │
├────────────────────┼───────────────┤
│"Visakhapatnam" │20 │
├────────────────────┼───────────────┤
│"Amritsar" │17 │
├────────────────────┼───────────────┤
│"Dehradun" │16 │
├────────────────────┼───────────────┤
│"Coimbatore" │15 │
├────────────────────┼───────────────┤
│"Jabalpur" │15 │
├────────────────────┼───────────────┤
│"Kanpur" │15 │
├────────────────────┼───────────────┤
│"Mangalore" │14 │
├────────────────────┼───────────────┤
│"Aurangabad" │10 │
├────────────────────┼───────────────┤
│"Madurai" │10 │
├────────────────────┼───────────────┤
│"Srinagar" │10 │
├────────────────────┼───────────────┤
│"Thiruvananthapuram"│10 │
├────────────────────┼───────────────┤
│"Vijayawada" │8 │
├────────────────────┼───────────────┤
│"Jammu" │7 │
├────────────────────┼───────────────┤
│"Calicut" │6 │
├────────────────────┼───────────────┤
│"Darbhanga" │6 │
├────────────────────┼───────────────┤
│"Hubli" │6 │
├────────────────────┼───────────────┤
│"Port Blair" │6 │
├────────────────────┼───────────────┤
│"Jodhpur" │5 │
├────────────────────┼───────────────┤
│"Tirupati" │4 │
├────────────────────┼───────────────┤
│"Belgaum" │3 │
├────────────────────┼───────────────┤
│"Gwalior" │3 │
├────────────────────┼───────────────┤
│"Leh" │3 │
├────────────────────┼───────────────┤
│"Rajahmundry" │3 │
├────────────────────┼───────────────┤
│"Tuticorin" │3 │
├────────────────────┼───────────────┤
│"Dibrugarh" │1 │
├────────────────────┼───────────────┤
│"Kandla" │1 │
├────────────────────┼───────────────┤
│"Pondicherry" │1 │
├────────────────────┼───────────────┤
│"Rajkot" │1 │
├────────────────────┼───────────────┤
│"Lilabari" │1 │
├────────────────────┼───────────────┤
│"Silchar" │1 │
├────────────────────┼───────────────┤
│"Porbandar" │0 │
└────────────────────┴───────────────┘

The metropolitan cities are a part of numerous triangles depicting their importance in the airline network while Porbandar isn’t a part of a single triangle

Extending Triangle Count, the Local clustering coefficient determines how well connected are the neighboring nodes of a node using Triangle Count using the formula

Local Clustering Coefficient(n) = 2xT(node)/d(n)(d(n)-1)

Where

  • T() gives the total frequency of Triangles a node n is part of in a network
  • d() is the degree of node n

Let’s see a quick example

Here, triangle count for node 3= 1(as part of 2 triangles, 4–3–5) & local clustering co-efficient = 2*1/3*2=0.33

A degree(3)=3

Let’s see which nodes have the highest clustering coefficient in our network

CALL gds.localClusteringCoefficient.stream('airline')YIELD nodeId, localClusteringCoefficientRETURN gds.util.asNode(nodeId).name AS name, localClusteringCoefficientORDER BY localClusteringCoefficient DESC

The output

════════════════════╤════════════════════════════╕
│"name" │"localClusteringCoefficient"│
╞════════════════════╪════════════════════════════╡
│"Aurangabad" │1.0 │
├────────────────────┼────────────────────────────┤
│"Bagdogra" │1.0 │
├────────────────────┼────────────────────────────┤
│"Belgaum" │1.0 │
├────────────────────┼────────────────────────────┤
│"Calicut" │1.0 │
├────────────────────┼────────────────────────────┤
│"Coimbatore" │1.0 │
├────────────────────┼────────────────────────────┤
│"Darbhanga" │1.0 │
├────────────────────┼────────────────────────────┤
│"Dibrugarh" │1.0 │
├────────────────────┼────────────────────────────┤
│"Hubli" │1.0 │
├────────────────────┼────────────────────────────┤
│"Jabalpur" │1.0 │
├────────────────────┼────────────────────────────┤
│"Kandla" │1.0 │
├────────────────────┼────────────────────────────┤
│"Kanpur" │1.0 │
├────────────────────┼────────────────────────────┤
│"Leh" │1.0 │
├────────────────────┼────────────────────────────┤
│"Madurai" │1.0 │
├────────────────────┼────────────────────────────┤
│"Pondicherry" │1.0 │
├────────────────────┼────────────────────────────┤
│"Port Blair" │1.0 │
├────────────────────┼────────────────────────────┤
│"Rajahmundry" │1.0 │
├────────────────────┼────────────────────────────┤
│"Thiruvananthapuram"│1.0 │
├────────────────────┼────────────────────────────┤
│"Tuticorin" │1.0 │
├────────────────────┼────────────────────────────┤
│"Rajkot" │1.0 │
├────────────────────┼────────────────────────────┤
│"Lilabari" │1.0 │
├────────────────────┼────────────────────────────┤
│"Silchar" │1.0 │
├────────────────────┼────────────────────────────┤
│"Mangalore" │0.9333333333333333 │
├────────────────────┼────────────────────────────┤
│"Bhopal" │0.8928571428571429 │
├────────────────────┼────────────────────────────┤
│"Udaipur" │0.8571428571428571 │
├────────────────────┼────────────────────────────┤
│"Jodhpur" │0.8333333333333334 │
├────────────────────┼────────────────────────────┤
│"Amritsar" │0.8095238095238095 │
├────────────────────┼────────────────────────────┤
│"Vijayawada" │0.8 │
├────────────────────┼────────────────────────────┤
│"Varanasi" │0.7878787878787878 │
├────────────────────┼────────────────────────────┤
│"Dehradun" │0.7619047619047619 │
├────────────────────┼────────────────────────────┤
│"Goa" │0.75 │
├────────────────────┼────────────────────────────┤
│"Surat" │0.7272727272727273 │
├────────────────────┼────────────────────────────┤
│"Visakhapatnam" │0.7142857142857143 │
├────────────────────┼────────────────────────────┤
│"Patna" │0.6952380952380952 │
├────────────────────┼────────────────────────────┤
│"Kochi" │0.6944444444444444 │
├────────────────────┼────────────────────────────┤
│"Jaipur" │0.6666666666666666 │
├────────────────────┼────────────────────────────┤
│"Tirupati" │0.6666666666666666 │
├────────────────────┼────────────────────────────┤
│"Srinagar" │0.6666666666666666 │
├────────────────────┼────────────────────────────┤
│"Pune" │0.6515151515151515 │
├────────────────────┼────────────────────────────┤
│"Guwahati" │0.6043956043956044 │
├────────────────────┼────────────────────────────┤
│"Gwalior" │0.5 │
├────────────────────┼────────────────────────────┤
│"Jammu" │0.4666666666666667 │
├────────────────────┼────────────────────────────┤
│"Ahmedabad" │0.45454545454545453 │
├────────────────────┼────────────────────────────┤
│"Kolkata" │0.383399209486166 │
├────────────────────┼────────────────────────────┤
│"Chennai" │0.3475783475783476 │
├────────────────────┼────────────────────────────┤
│"Hyderabad" │0.2752688172043011 │
├────────────────────┼────────────────────────────┤
│"Delhi" │0.2638655462184874 │
├────────────────────┼────────────────────────────┤
│"Bengaluru" │0.2531194295900178 │
├────────────────────┼────────────────────────────┤
│"Mumbai" │0.23174603174603176 │
├────────────────────┼────────────────────────────┤
│"Porbandar" │0.0 │
└────────────────────┴────────────────────────────┘

The results are indeed interesting, there are a few points to highlight

  • Some tier 2 cities have a coefficient. This may indicate 2 things
  1. They really don't have many neighbors
  2. Their neighbors are indeed well connected

The metro cities can be found in the footer of the list may be due to the fact that they have many neighbors that are mostly tier 2 cities and hence aren’t as well connected as they are

--

--