Message Passing in Graphs
In this blog post, I will discuss the message passing algorithm, which serves as the backbone for graph neural network algorithms. This algorithm facilitates information processing through the nodes of the graph, enabling the network to learn various graph attributes. Consequently, we can perform tasks such as node classification and link prediction. Let’s dive in!
Message Passing
Let’s explore message passing through the lens of a node classification task. So, what exactly is node classification?
Node Classification: Given a network with labels on some nodes, how do we assign labels to all other nodes in the network?
Example: In a network where some nodes are identified as fraudsters and others as fully trusted, the challenge is to identify additional fraudsters and trustworthy nodes based on their interactions and behaviors within the network?
Node classification typically falls under the category of semi-supervised learning. It leverages both labeled and unlabeled nodes within a graph to predict the labels of unlabeled nodes.
First let’s discuss the intuition behind message passing framework.
Message passing in graphs entails nodes exchanging information (messages) with their neighbors iteratively. This process updates each node’s state based on aggregated information, enabling nodes to integrate knowledge from their local neighborhoods. This correlation in graph networks supports tasks such as node classification or link prediction.
Simply put, nodes that are similar are connected in some manner. Node classification is addressed using a technique called collective classification, where labels are assigned to all nodes in a network simultaneously.
Now let’s discuss different techniques of message passing algorithms.
We will be discussing majorly 3 different message passing techniques.
- Relational classification
- Iterative classification
- Belief propagation
We will be discussing them in detail one by one. But before that, let’s discuss some important things that we will be using throughout this blog.
- Individual behaviors are correlated in the network
- Correlation: nearby nodes have the same color (belonging to the same class)
There are mainly two types of dependencies that lead to correlation:
- Homophily
- Influence
Let’s discuss both of them.
- HomoPhily: Homophily refers to the tendency of individuals to associate and bond with others who are similar to themselves in characteristics such as beliefs, interests, or demographics.
- Example 1: Researchers who focus on the same research area are more likely to establish a connection (meeting at conferences, interacting in academic talks, etc.)
- Example 2: Online social network
— Nodes = people
— Edges = friendship
— Node color = interests (sports, arts, etc.)
2. Influence: Social connections can influence the individual characteristics of a person.
- Example: I recommend my musical preferences to my friends, until one of them grows to like my same favorite genres!
How do we leverage this correlation observed in networks to help predict node labels?
Intuition: Similar nodes are typically close together or directly connected in the network
Approach:
— Guilt-by-association: If I am connected to a node with label 𝑋, then I am likely to have label 𝑋 as well.
Example:
— Malicious/benign web page: Malicious web pages link to one another to increase visibility, look credible, and rank higher in search engines.
Classification label of a node 𝑣 in network may depend on:
- Features of 𝑣
- Labels of the nodes in 𝑣’s neighborhood
- Features of the nodes in 𝑣’s neighborhood
So, how to find label? Let’s understand the task first.
Given:
— Graph
— Few labeled nodes
Find: class (red/green) of remaining nodes
Main assumption: There is homophily in the network
Let’s try to solve it using semi-supervised learning approach.
Example task from above graph:
- Let 𝑨 be a 𝑛×𝑛 adjacency matrix over 𝑛 nodes
- Let Y = {0, 1}ⁿ be a vector of labels:
- Yᵥ = 1 belongs to Class 1
- Yᵥ = 0 belongs to Class 0
- There are unlabeled node needs to be classified - Goal: Predict which unlabeled nodes are likely Class 1, and which are likely Class 0
As mentioned earlier, this problem is solved by collective classification. Let’s learn how to perform collective classification.
Collective Classification
- Intuition: Simultaneous classification of interlinked nodes using correlations.
- Probabilistic framework
- Markov Assumption: The label 𝑌ᵥ of one node v depends on the labels of its neighbors 𝑁ᵥ
P(Yᵥ) = P(Yᵥ | Nᵥ)
Collective classification involves 3 steps:
- Local Classifier: Used for initial label assignment
- Predicts label based on node attributes/features
- Standard classification task
- Does not use network information - Relational Classifier: Capture correlations
- Learns a classifier to label one node based on the labels and/or attributes of its neighbors
- This is where network information is used - Collective Inference: Propagate the correlation
- Apply relational classifier to each node iteratively
- Iterate until the inconsistency between neighboring labels is minimized
- Network structure affects the final prediction
Till now I have covered all the basic concepts required to understand the message passing algorithm. Before going back to 3 message passing algorithms I mentioned above, let’s first define a problem statement that we will take as example to understand message passing.
Problem Statement
- How to predict the labels 𝑌ᵥ for the unlabeled nodes v(in grey color)?
- Each node v has a feature vector 𝑓ᵥ
- Labels for some nodes are given (1 for green, 0 for red)
- Task: Find 𝑃(𝑌ᵥ) given all features and the network
- We focus on semi-supervised node classification.
- Intuition is based on homophily: Similar nodes are typically close together or directly connected.
Relational Classification
Basic idea: Class probability 𝑌ᵥ of node 𝑣 is a weighted average of class probabilities of its neighbors.
Steps:
- For labeled nodes 𝑣, initialize label 𝑌ᵥ with ground-truth label 𝑌ᵥ*
- For unlabeled nodes, initialize 𝑌ᵥ = 0.5
- Update all nodes in a random order until convergence or until maximum number of iterations is reached
- Update for each node 𝑣 and label 𝑐 (e.g. 0 or 1)
- If edges have strength/weight information, 𝐴ᵥ,ᵤ can be the edge weight between v and u
- 𝑃(𝑌ᵥ = 𝑐) is the probability of node v having label c
Challenges:
- Convergence is not guaranteed
- Model cannot use node feature information
Example:
Initialization:
- All labeled nodes with their labels
- All unlabeled nodes 0.5 (belonging to class 1 with probability 0.5)
- Let 𝑃ᵧ₁ = 𝑃(𝑌₁ = 1)
Update for the 1st Iteration:
- For node 3, 𝑁₃ = {1, 2, 4}
- For node 4, 𝑁₄ = {1, 3, 5, 6}
- For node 5, 𝑁₅ = {4, 6, 7, 8}
- After Iteration 1 (a round of updates for all unlabeled nodes)
- After Iteration 2
- After Iteration 3
- After Iteration 4
- All scores stabilize after 4 iterations. We therefore predict:
- Nodes 4, 5, 8, 9 belong to class 1 (𝑃ᵧᵥ > 0.5)
- Nodes 3 belong to class 0 (𝑃ᵧᵥ < 0.5)
Iterative classification
Relational classifiers do not use node attributes. How can one leverage them? Main idea of iterative classification: Classify node v based on its attributes 𝒇ᵥ as well as labels 𝒛ᵥ of neighbor set 𝑵ᵥ.
- Input: Graph
- fᵥ : feature vector for node v
- Some nodes 𝑣 are labeled with 𝑌ᵥ - Task: Predict label of unlabeled nodes
- Approach: Train two classifiers:
- 𝜙%(𝑓ᵥ) = Predict node label based on node feature vector 𝑓ᵥ
- 𝜙’ (𝑓ᵥ, 𝑧ᵥ) = Predict label based on node feature vector 𝑓ᵥ and summary 𝑧ᵥ of labels of v’s neighbors.
Now the question comes about computing summary. How do we compute the summary 𝒛ᵥ of labels of v’s neighbors 𝑵ᵥ?
- Idea: 𝒛ᵥ = vector
- Histogram of the number (or fraction) of each label in 𝑁ᵥ
- Most common label in 𝑁ᵥ
- Number of different labels in 𝑁ᵥ
Now, let’s discuss the overall process of training iterative classifiers.
- Phase 1: Classify based on node attributes alone
- On a training set, train classifier (e.g., linear classifier, neural networks, …):
- 𝜙₁(𝑓ᵥ) to predict 𝑌ᵥ based on 𝑓ᵥ
- 𝜙₂(fᵥ, 𝑧ᵥ) to predict 𝑌ᵥ based on 𝑓ᵥ and summary 𝑧ᵥ of labels of v’s neighbors - Phase 2: Iterate till convergence
- On test set, set labels 𝑌ᵥ based on the classifier 𝜙₁, compute 𝑧ᵥ and predict the labels with 𝜙₂
- Repeat for each node 𝑣
- Update 𝑧ᵥ based on 𝑌ᵥ for all 𝑢 ∈ 𝑁ᵥ
- Update 𝑌ᵥ based on the new 𝑧ᵥ (𝜙₂)
- Iterate until class labels stabilize or max number of iterations is reached
- Note: Convergence is not guaranteed
Let’s understand the above process with an example. We will take example of web-page classification for this.
- Input: Graph of web pages
- Node: Web page
- Edge: Hyper-link between web pages
- Directed edge: a page points to another page
- Node features: Webpage description
- For simplicity, we only consider 2 binary features - Task: Predict the topic of the webpage
Steps:
- Baseline: train a classifier (e.g., linear classifier) to classify pages based on binary node attributes
- Each node maintains vectors 𝒛𝒗 of neighborhood labels:
- 𝐼 = Incoming neighbor label information vector
- 𝑂 = Outgoing neighbor label information vector - I₀ = 1 if at least one of the incoming pages is labelled 0. Similar definitions for 𝐼₁, 𝑂₀, and 𝑂₁
- On a different training set, train two classifiers:
- Node attribute vector only (green circles): 𝜙₁
- Node attribute and link vectors (red circles): 𝜙₂
- On the test set:
- Use trained node feature vector classifier 𝜙₁ to set 𝑌ᵥ
- Update 𝑧ᵥ for all nodes
- Re-classify all nodes with 𝜙₂
- Continue until convergence
- Update 𝑧ᵥ
- Update 𝑌ᵥ = 𝜙₂(𝑓ᵥ, 𝑧ᵥ)
- Stop iteration
- After convergence or when maximum iterations are reached
And that’s how you perform iterative classification approach. Now let’s discuss the last approach known as Belief Propagation.
Belief Propagation
Belief Propagation is a dynamic programming approach to answering probability queries in a graph (e.g. probability of node v belonging to class 1). Iterative process in which neighbor nodes “talk” to each other, passing messages. When consensus is reached, calculate final belief.
Introduction:
- Task: Count the number of nodes in a graph*
- Condition: Each node can only interact (pass message) with its neighbors
Example: Path Graph
Algorithm:
- Define an ordering of nodes (that results in a path)
- Edge directions are according to order of nodes
- Edge direction defines the order of message passing - For node 𝑖 from 1 to 6
- Compute the message from node 𝑖 to 𝑖 + 1 (number of nodes counted so far)
- Pass the message from node 𝑖 to 𝑖 + 1
- Condition: Each node can only interact (pass message) with its neighbors
- Solution: Each node listens to the message from its neighbor, updates it, and passes it forward m: the message
Generalizing to a tree
- We can perform message passing not only on a path graph, but also on a tree-structured graph.
- Define order of message passing from leaves to root.
- Update beliefs in tree structure
Question: What message will 𝑖 send to 𝑗?
- It depends on what 𝑖 hears from its neighbors
- Each neighbor passes a message to 𝑖 its beliefs of the state of 𝑖
Let’s first define some notations before going into the algorithm
- Label-label potential matrix 𝝍 : Dependency between a node and its neighbor. 𝝍(𝑌ᵢ, 𝑌ⱼ) is proportional to the probability of a node 𝑗 being in class 𝑌ⱼ given that it has neighbor 𝑖 in class 𝑌ᵢ.
- Prior belief 𝝓: 𝜙(𝑌ᵢ) is proportional to the probability of node 𝑖 being in class 𝑌ᵢ.
- 𝑚ᵢ →ⱼ(𝑌ⱼ) is 𝑖’s message / estimate of 𝑗 being in class 𝑌ⱼ.
- ℒ is the set of all classes/labels
Steps:
- Initialize all messages to 1
- Repeat for each node:
After convergence:
- 𝑏ᵢ(𝑌ᵢ) = node 𝑖’s belief of being in class 𝑌ᵢ
Let’s understand this with an example. We will include a graph with cycles as well. This process of using Belief Propagation in cycled graph is also called as Loopy Belief Propagation.
Messages from different subgraphs are no longer independent!
But we can still run BP, but it will pass messages in loops.
- Beliefs may not converge
- Message 𝑚ᵤ → ᵢ (𝑌ᵢ) is based on initial belief of 𝑖, not a separate evidence for 𝑖
- The initial belief of 𝑖 (which could be incorrect) is reinforced by the cycle i → 𝑗 → 𝑘 → 𝑢 → 𝑖 - However, in practice, Loopy BP is still a good heuristic for complex graphs which contain many branches.
Challenges:
- Messages loop around and around: 2, 4, 8, 16, 32, … More and more convinced that these variables are T!
- BP incorrectly treats this message as separate evidence that the variable
is T (true). - Multiplies these two messages as if they were independent.
- But they don’t actually come from independent parts of the graph.
- One influenced the other (via a cycle).
- Advantages:
- Easy to program & parallelize
- General: can apply to any graph model with any form of potentials
- Potential can be higher order: e.g. 𝝍(𝑌ᵢ, 𝑌ⱼ, 𝑌ₖ, 𝑌ᵥ … ) - Issues:
- Convergence is not guaranteed (when to stop), especially if many closed loops - Potential functions (parameters) :
- Require training to estimate
That’s all for message passing algorithm. Next we will be discussing about graph neural networks in upcoming blog. Stay tuned for that….