Stories by Dhaval Taunk on Medium

Message Passing in Graphs

Dhaval Taunk — Thu, 04 Jul 2024 18:20:02 GMT

In this blog post, I will discuss the message passing algorithm, which serves as the backbone for graph neural network algorithms. This algorithm facilitates information processing through the nodes of the graph, enabling the network to learn various graph attributes. Consequently, we can perform tasks such as node classification and link prediction. Let’s dive in!

Message Passing

Let’s explore message passing through the lens of a node classification task. So, what exactly is node classification?

Node Classification: Given a network with labels on some nodes, how do we assign labels to all other nodes in the network?

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

Example: In a network where some nodes are identified as fraudsters and others as fully trusted, the challenge is to identify additional fraudsters and trustworthy nodes based on their interactions and behaviors within the network?

Node classification typically falls under the category of semi-supervised learning. It leverages both labeled and unlabeled nodes within a graph to predict the labels of unlabeled nodes.

First let’s discuss the intuition behind message passing framework.

Message passing in graphs entails nodes exchanging information (messages) with their neighbors iteratively. This process updates each node’s state based on aggregated information, enabling nodes to integrate knowledge from their local neighborhoods. This correlation in graph networks supports tasks such as node classification or link prediction.

Simply put, nodes that are similar are connected in some manner. Node classification is addressed using a technique called collective classification, where labels are assigned to all nodes in a network simultaneously.

Now let’s discuss different techniques of message passing algorithms.

We will be discussing majorly 3 different message passing techniques.

Relational classification
Iterative classification
Belief propagation

We will be discussing them in detail one by one. But before that, let’s discuss some important things that we will be using throughout this blog.

Individual behaviors are correlated in the network
Correlation: nearby nodes have the same color (belonging to the same class)

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

There are mainly two types of dependencies that lead to correlation:

Homophily
Influence

Let’s discuss both of them.

HomoPhily: Homophily refers to the tendency of individuals to associate and bond with others who are similar to themselves in characteristics such as beliefs, interests, or demographics.
- Example 1: Researchers who focus on the same research area are more likely to establish a connection (meeting at conferences, interacting in academic talks, etc.)
- Example 2: Online social network
— Nodes = people
— Edges = friendship
— Node color = interests (sports, arts, etc.)

People with the same interest are more closely connected due to homophily — Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

2. Influence: Social connections can influence the individual characteristics of a person.
- Example: I recommend my musical preferences to my friends, until one of them grows to like my same favorite genres!

How do we leverage this correlation observed in networks to help predict node labels?

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

Intuition: Similar nodes are typically close together or directly connected in the network

Approach:
— Guilt-by-association: If I am connected to a node with label 𝑋, then I am likely to have label 𝑋 as well.

Example:
— Malicious/benign web page: Malicious web pages link to one another to increase visibility, look credible, and rank higher in search engines.

Classification label of a node 𝑣 in network may depend on:

Features of 𝑣
Labels of the nodes in 𝑣’s neighborhood
Features of the nodes in 𝑣’s neighborhood

So, how to find label? Let’s understand the task first.

Given:
— Graph
— Few labeled nodes

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

Find: class (red/green) of remaining nodes

Main assumption: There is homophily in the network

Let’s try to solve it using semi-supervised learning approach.

Example task from above graph:

Let 𝑨 be a 𝑛×𝑛 adjacency matrix over 𝑛 nodes
Let Y = {0, 1}ⁿ be a vector of labels:
- Yᵥ = 1 belongs to Class 1
- Yᵥ = 0 belongs to Class 0
- There are unlabeled node needs to be classified
Goal: Predict which unlabeled nodes are likely Class 1, and which are likely Class 0

As mentioned earlier, this problem is solved by collective classification. Let’s learn how to perform collective classification.

Collective Classification

Intuition: Simultaneous classification of interlinked nodes using correlations.
Probabilistic framework
Markov Assumption: The label 𝑌ᵥ of one node v depends on the labels of its neighbors 𝑁ᵥ

P(Yᵥ) = P(Yᵥ | Nᵥ)

Collective classification involves 3 steps:

Local Classifier: Used for initial label assignment
- Predicts label based on node attributes/features
- Standard classification task
- Does not use network information
Relational Classifier: Capture correlations
- Learns a classifier to label one node based on the labels and/or attributes of its neighbors
- This is where network information is used
Collective Inference: Propagate the correlation
- Apply relational classifier to each node iteratively
- Iterate until the inconsistency between neighboring labels is minimized
- Network structure affects the final prediction

Till now I have covered all the basic concepts required to understand the message passing algorithm. Before going back to 3 message passing algorithms I mentioned above, let’s first define a problem statement that we will take as example to understand message passing.

Problem Statement

How to predict the labels 𝑌ᵥ for the unlabeled nodes v(in grey color)?
Each node v has a feature vector 𝑓ᵥ
Labels for some nodes are given (1 for green, 0 for red)
Task: Find 𝑃(𝑌ᵥ) given all features and the network

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

We focus on semi-supervised node classification.
Intuition is based on homophily: Similar nodes are typically close together or directly connected.

Relational Classification

Basic idea: Class probability 𝑌ᵥ of node 𝑣 is a weighted average of class probabilities of its neighbors.

Steps:

For labeled nodes 𝑣, initialize label 𝑌ᵥ with ground-truth label 𝑌ᵥ*
For unlabeled nodes, initialize 𝑌ᵥ = 0.5
Update all nodes in a random order until convergence or until maximum number of iterations is reached
Update for each node 𝑣 and label 𝑐 (e.g. 0 or 1)

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

If edges have strength/weight information, 𝐴ᵥ,ᵤ can be the edge weight between v and u
𝑃(𝑌ᵥ = 𝑐) is the probability of node v having label c

Challenges:

Convergence is not guaranteed
Model cannot use node feature information

Example:

Initialization:

All labeled nodes with their labels
All unlabeled nodes 0.5 (belonging to class 1 with probability 0.5)
Let 𝑃ᵧ₁ = 𝑃(𝑌₁ = 1)

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

Update for the 1st Iteration:

For node 3, 𝑁₃ = {1, 2, 4}

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

For node 4, 𝑁₄ = {1, 3, 5, 6}

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

For node 5, 𝑁₅ = {4, 6, 7, 8}

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

After Iteration 1 (a round of updates for all unlabeled nodes)

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

After Iteration 2

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

After Iteration 3

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

After Iteration 4

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

All scores stabilize after 4 iterations. We therefore predict:
- Nodes 4, 5, 8, 9 belong to class 1 (𝑃ᵧᵥ > 0.5)
- Nodes 3 belong to class 0 (𝑃ᵧᵥ < 0.5)

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

Iterative classification

Relational classifiers do not use node attributes. How can one leverage them? Main idea of iterative classification: Classify node v based on its attributes 𝒇ᵥ as well as labels 𝒛ᵥ of neighbor set 𝑵ᵥ.

Input: Graph
- fᵥ : feature vector for node v
- Some nodes 𝑣 are labeled with 𝑌ᵥ
Task: Predict label of unlabeled nodes
Approach: Train two classifiers:
𝜙%(𝑓ᵥ) = Predict node label based on node feature vector 𝑓ᵥ
𝜙’ (𝑓ᵥ, 𝑧ᵥ) = Predict label based on node feature vector 𝑓ᵥ and summary 𝑧ᵥ of labels of v’s neighbors.

Now the question comes about computing summary. How do we compute the summary 𝒛ᵥ of labels of v’s neighbors 𝑵ᵥ?

Idea: 𝒛ᵥ = vector
- Histogram of the number (or fraction) of each label in 𝑁ᵥ
- Most common label in 𝑁ᵥ
- Number of different labels in 𝑁ᵥ

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

Now, let’s discuss the overall process of training iterative classifiers.

Phase 1: Classify based on node attributes alone
- On a training set, train classifier (e.g., linear classifier, neural networks, …):
- 𝜙₁(𝑓ᵥ) to predict 𝑌ᵥ based on 𝑓ᵥ
- 𝜙₂(fᵥ, 𝑧ᵥ) to predict 𝑌ᵥ based on 𝑓ᵥ and summary 𝑧ᵥ of labels of v’s neighbors
Phase 2: Iterate till convergence
- On test set, set labels 𝑌ᵥ based on the classifier 𝜙₁, compute 𝑧ᵥ and predict the labels with 𝜙₂
- Repeat for each node 𝑣
- Update 𝑧ᵥ based on 𝑌ᵥ for all 𝑢 ∈ 𝑁ᵥ
- Update 𝑌ᵥ based on the new 𝑧ᵥ (𝜙₂)
- Iterate until class labels stabilize or max number of iterations is reached
- Note: Convergence is not guaranteed

Let’s understand the above process with an example. We will take example of web-page classification for this.

Input: Graph of web pages
Node: Web page
Edge: Hyper-link between web pages
Directed edge: a page points to another page
Node features: Webpage description
- For simplicity, we only consider 2 binary features
Task: Predict the topic of the webpage

Steps:

Baseline: train a classifier (e.g., linear classifier) to classify pages based on binary node attributes

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

Each node maintains vectors 𝒛𝒗 of neighborhood labels:
- 𝐼 = Incoming neighbor label information vector
- 𝑂 = Outgoing neighbor label information vector
I₀ = 1 if at least one of the incoming pages is labelled 0. Similar definitions for 𝐼₁, 𝑂₀, and 𝑂₁

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

On a different training set, train two classifiers:
- Node attribute vector only (green circles): 𝜙₁
- Node attribute and link vectors (red circles): 𝜙₂

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

On the test set:
- Use trained node feature vector classifier 𝜙₁ to set 𝑌ᵥ

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

Update 𝑧ᵥ for all nodes

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

Re-classify all nodes with 𝜙₂

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

Continue until convergence
- Update 𝑧ᵥ
- Update 𝑌ᵥ = 𝜙₂(𝑓ᵥ, 𝑧ᵥ)

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

Stop iteration
- After convergence or when maximum iterations are reached

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

And that’s how you perform iterative classification approach. Now let’s discuss the last approach known as Belief Propagation.

Belief Propagation

Belief Propagation is a dynamic programming approach to answering probability queries in a graph (e.g. probability of node v belonging to class 1). Iterative process in which neighbor nodes “talk” to each other, passing messages. When consensus is reached, calculate final belief.

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

Introduction:

Task: Count the number of nodes in a graph*
Condition: Each node can only interact (pass message) with its neighbors

Example: Path Graph

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

Algorithm:

Define an ordering of nodes (that results in a path)
Edge directions are according to order of nodes
- Edge direction defines the order of message passing
For node 𝑖 from 1 to 6
- Compute the message from node 𝑖 to 𝑖 + 1 (number of nodes counted so far)
- Pass the message from node 𝑖 to 𝑖 + 1
Condition: Each node can only interact (pass message) with its neighbors
Solution: Each node listens to the message from its neighbor, updates it, and passes it forward m: the message

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

Generalizing to a tree

We can perform message passing not only on a path graph, but also on a tree-structured graph.
Define order of message passing from leaves to root.

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

Update beliefs in tree structure

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

Question: What message will 𝑖 send to 𝑗?

It depends on what 𝑖 hears from its neighbors
Each neighbor passes a message to 𝑖 its beliefs of the state of 𝑖

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

Let’s first define some notations before going into the algorithm

Label-label potential matrix 𝝍 : Dependency between a node and its neighbor. 𝝍(𝑌ᵢ, 𝑌ⱼ) is proportional to the probability of a node 𝑗 being in class 𝑌ⱼ given that it has neighbor 𝑖 in class 𝑌ᵢ.
Prior belief 𝝓: 𝜙(𝑌ᵢ) is proportional to the probability of node 𝑖 being in class 𝑌ᵢ.
𝑚ᵢ →ⱼ(𝑌ⱼ) is 𝑖’s message / estimate of 𝑗 being in class 𝑌ⱼ.
ℒ is the set of all classes/labels

Steps:

Initialize all messages to 1
Repeat for each node:

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

After convergence:

𝑏ᵢ(𝑌ᵢ) = node 𝑖’s belief of being in class 𝑌ᵢ

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

Let’s understand this with an example. We will include a graph with cycles as well. This process of using Belief Propagation in cycled graph is also called as Loopy Belief Propagation.

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

Messages from different subgraphs are no longer independent!

But we can still run BP, but it will pass messages in loops.

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

Beliefs may not converge
- Message 𝑚ᵤ → ᵢ (𝑌ᵢ) is based on initial belief of 𝑖, not a separate evidence for 𝑖
- The initial belief of 𝑖 (which could be incorrect) is reinforced by the cycle i → 𝑗 → 𝑘 → 𝑢 → 𝑖
However, in practice, Loopy BP is still a good heuristic for complex graphs which contain many branches.

Challenges:

Messages loop around and around: 2, 4, 8, 16, 32, … More and more convinced that these variables are T!
BP incorrectly treats this message as separate evidence that the variable
is T (true).
Multiplies these two messages as if they were independent.
- But they don’t actually come from independent parts of the graph.
- One influenced the other (via a cycle).

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

Advantages:
- Easy to program & parallelize
- General: can apply to any graph model with any form of potentials
- Potential can be higher order: e.g. 𝝍(𝑌ᵢ, 𝑌ⱼ, 𝑌ₖ, 𝑌ᵥ … )
Issues:
- Convergence is not guaranteed (when to stop), especially if many closed loops
Potential functions (parameters) :
- Require training to estimate

That’s all for message passing algorithm. Next we will be discussing about graph neural networks in upcoming blog. Stay tuned for that….

Message Passing in Graphs was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.

Graph Representational Learning: Creating node and graph embeddings — Part 2

Dhaval Taunk — Mon, 01 Jul 2024 15:15:04 GMT

Graph Representational Learning: Creating node and graph embeddings — Part 2

In the previous blog post, I covered various techniques for node-level and graph-level embeddings, explaining their intuition and training methods. Now, we’ll delve into coding some of these techniques in Python. Let’s begin!

1. Node2Vec

First, we’ll begin with Node2Vec. We’ll use NetworkX to create a random graph, then proceed to train the Node2Vec algorithm by generating random walks over the graph using Word2Vec from the gensim package.

Install the necessary packages

pip install networkx node2vec

2. Next, we create the input graph using the NetworkX package.

import networkx as nx

G = nx.fast_gnp_random_graph(n=100, p=0.5)

The above technique creates a graph with 100 nodes. The parameter p defines the probability of two nodes being connected to each other. Therefore, this graph won’t be fully connected. Instead, it should have approximately half the edges compared to a fully connected graph with 100 nodes. You can adjust the p value as desired.

3. Now, we will initialize a Node2Vec class that takes the generated graph as input and generates random walks over the graph.

from node2vec import Node2Vec

node2vec = Node2Vec(G, dimensions=64, walk_length=30, num_walks=200, workers=4)

In the above code, you can see that I am creating random walks with a length of 30 and a total number of walks equal to 200. The embedding size is set to 64 in this case. Feel free to adjust these parameters according to your specific use-case.

4. Next, let’s fit the created graph and train the model on it.

model = node2vec.fit(window=10, min_count=1, batch_words=4)

5. The next step will involve saving the trained model in order to extract embeddings from it.

model.wv.save_word2vec_format("embeddings_node2vec.txt")

6. To extract embeddings from the model, you can use the following code:

embeddings = {str(node): model.wv[str(node)] for node in G.nodes()}

Now feel free to experiment with it as you like.

DeepWalk

The second algorithm we’ll discuss is DeepWalk. The coding approach remains mostly the same as Node2Vec. The difference lies in the walking strategy, where DeepWalk employs biased random walks instead of the random walks used in Node2Vec.

Installing packages

pip install networkx karateclub

2. Importing required packages

from karateclub import DeepWalk
import networkx as nx

3. Creating a graph

G = nx.fast_gnp_random_graph(n=100, p=0.5)

4. Initialize DeepWalk class

model = DeepWalk(dimensions=64, walk_length=30, num_walks=200, workers=4)

5. Fitting the model

model.fit(G)

6. Get the embeddings

embeddings = model.get_embedding()

So that’s how you create DeepWalk embeddings. Feel free to experiment with this approach.

Graph2Vec

The last algorithm I’m going to discuss is Graph2Vec. It differs slightly from the previous two algorithms because it creates graph-level embeddings instead of node-level embeddings.

Installing required packages

pip install karateclub networkx

2. Importing the packages

import networkx as nx
from karateclub import Graph2Vec
import os

os.makedirs('graphs', exist_ok=True)

3. Creating the graph

for i in range(5):
    G = nx.fast_gnp_random_graph(n=10 + i, p=0.5)
    nx.write_gml(G, f'graphs/graph_{i}.gml')

4. Creating a list of graphs for training

graphs = []
for i in range(5):
    G = nx.read_gml(f'graphs/graph_{i}.gml')
    graphs.append(G)

5. Fitting the graph using Graph2Vec algorithm

model = Graph2Vec(dimensions=64, wl_iterations=2, attributed=False)
model.fit(graphs)

6. Extracting the embeddings

embeddings = model.get_embedding()

for idx, embedding in enumerate(embeddings):
    print(f'Embedding for graph_{idx}: {embedding}')

So that’s how you learn Graph2Vec embeddings.

For now, that’s it from my side. I hope you enjoyed this blog. Stay tuned for more important topics in the next one.

Graph Representational Learning: Creating node and graph embeddings — Part 2 was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.

Graph Representational Learning: Creating node and graph embeddings

Dhaval Taunk — Thu, 20 Jun 2024 17:08:34 GMT

In the previous blog post, we covered traditional methods for extracting features from input graphs. In this post, we’ll delve into various approaches for generating node and graph-level embeddings. This includes techniques such as DeepWalk and Node2Vec for node embeddings, as well as Anonymous Walk for graph-level embeddings. Let’s dive in!

Node Embedding

The primary challenge with traditional hand-crafted features lies in their time-consuming creation process and their task-specific nature. Consequently, there is a pressing need for a more efficient method to encode input features. What are our next steps?

In natural language processing, concepts like learned embeddings such as Word2Vec are well-established. We can apply a similar approach here. Let’s delve into this further.

1. Encoder — Decoder Approach

The primary objective of this approach is to generate node-level embeddings in such a way that nodes with similar characteristics are represented by embeddings that are closer to each other, while embeddings of dissimilar nodes are farther apart.

Perozzi et al. DeepWalk: Online Learning of Social Representations. KDD 2014

Let’s explore the process to achieve this. Assume we have a graph G with:

V as the vertex set,
A as the adjacency matrix.

The objective is to encode nodes in such a way that their similarity in the embedding space (e.g., measured by dot product) reflects their similarity in the graph structure.

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

Process:

The encoder ENC maps nodes to embeddings.
Define a node similarity function, which measures similarity in the original network.
The decoder DEC maps embeddings to similarity scores.
Optimize the parameters of the encoder to ensure embeddings reflect the node similarities captured by the decoder.

similarity(u, v) ≈ zᵤᵀzᵥ

Let’s clarify the roles of the encoder and the similarity function mentioned earlier:

Encoder: The encoder’s role is to transform each node into a low-dimensional vector representation, known as an embedding.

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

Similarity Function: Specifies how the relationships in the vector space correspond to those in the original network. It plays a key role in the decoder section of the approach.

As for initializing these embeddings, a common approach is:

Shallow Encodings: A straightforward approach is to treat the encoder as an embedding lookup.

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

Each node is assigned a unique embedding vector, meaning we directly optimize the embedding for each individual node.

2. Random Walk Approach

Given a graph and a starting point, we randomly select a neighbor of the starting point and move to this neighbor. We continue this process by selecting random neighbors of the current point and moving accordingly. This sequence of points visited in this manner constitutes a random walk on the graph.

zᵤᵀzᵥ ≈ probability that u and v co-occur on a random walk over the graph

Steps:

Estimate the probability of visiting node v on a random walk starting from node u using a specific random walk strategy R.
Optimize embeddings to encode these statistics derived from random walks.

We will explore how to optimize the embeddings. Before that, let’s delve into why we employ the Random Walk approach.

Expressivity: This method offers a flexible stochastic definition of node similarity, integrating both local and higher-order neighborhood details. The idea is that if a random walk originating from node 𝒖 frequently visits 𝒗, then 𝒖 and 𝒗 are deemed similar, leveraging high-order multi-hop information.
Efficiency: By focusing solely on pairs of nodes that co-occur during random walks, we eliminate the need to consider all possible node pairs during training.

Next, let’s explore the process of learning the embeddings. Clearly, from its description, this process is an unsupervised feature learning procedure.

Intuition: The goal is to find embeddings of nodes in a d-dimensional space that preserve their similarities.
Idea: The objective is to learn node embeddings such that nodes that are close in the network are also close in the embedding space.
For a given node 𝑢, the notion of nearby nodes is defined by 𝑁ᵣ(u), the neighborhood of 𝑢 obtained through some random walk strategy 𝑟.
Given a graph 𝐺 = (𝑉, 𝐸), our aim is to learn a mapping 𝑓: 𝑢 → ℝᴰ, where 𝑓₍ᵤ₎ = 𝐳ᵤ, representing the embedding vector of node 𝑢.

The objective function is structured as follows:

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

Given node 𝑢, our aim is to learn feature representations that predict nodes within its random walk neighborhood 𝑁ᵣ(u).

Optimization Steps:

Conduct short fixed-length random walks starting from each node 𝑢 in the graph using a specified random walk strategy R.
For each node 𝑢, gather 𝑁ᵣ(𝑢), which represents the multiset* of nodes visited during random walks initiated from 𝑢.
Optimize embeddings based on the following principle: Given a node 𝑢, predict its neighbors 𝑁ᵣ(u).

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

which is equivalent to:

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

Intuition: We optimize embeddings 𝑧ᵤ to maximize the likelihood of co-occurrences in random walks.

We parameterize 𝑃(𝑣|𝐳ᵤ) using softmax.

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

Why softmax? We aim for node 𝑣 to be most similar to node 𝑢 among all nodes 𝑛.

Intuition: ∑ᵢ exp(xᵢ) ≈ maxᵢ exp(𝑥ᵢ)

Now, let’s combine all these insights. The overall equation looks like this:

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

Therefore, we can conclude that:

Optimizing random walk embeddings involves finding embeddings 𝐳ᵤ that minimize L.

However, when dealing with graphs of immense size, optimizing with softmax, which includes a costly normalizing term, becomes impractical. So, what is the solution for this?

This is where Negative Sampling comes into play. First, let’s explain what negative sampling is, and then we’ll explore how to apply it for our purposes.

Negative Sampling: Negative sampling is a method in machine learning, often used in neural networks like word embeddings, where instead of learning from all possible examples, the model trains on a subset of “negative” examples alongside positive ones to improve efficiency and performance.

Now, let’s discuss, how to use it in our case.

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

Sample 𝑘 negative nodes, each selected with a probability proportional to its degree. There are two considerations regarding 𝑘 (# of negative samples):

A higher 𝑘 provides more robust estimates.
However, a higher 𝑘 also introduces a higher bias towards negative events. In practice, 𝑘 typically ranges from 5 to 20.

After formulating the objective function, the next step is to optimize (minimize) it. This is usually accomplished using iterative optimization methods such as stochastic gradient descent (SGD) or its variants, which adjust the embeddings to gradually reduce the objective function’s value until convergence.

Gradient Descent: a straightforward method to minimize ℒ:

Initialize 𝑧ᵢ to random values for all 𝑖.
Iterate until convergence.

a. Compute the derivative ∂ℒ/∂zᵢ for all 𝑖.

b. Update each 𝑧ᵢ by taking a step in the direction of its derivative:

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

That’s the essence of how we perform random walks over graphs to learn embeddings for them. The DeepWalk concept is built upon this intuition. Let’s delve into it a bit.

3. DeepWalk

DeepWalk is a method for learning representations of nodes in a graph by leveraging techniques from natural language processing. It uses the above explained random walk technique to generate sequences of nodes in the graph and then applies a Skip-gram model (similar to word2vec) to learn embeddings that capture the structural context of nodes.

By treating nodes as words and sequences of nodes as sentences, DeepWalk learns distributed representations that encode the graph’s topology and connectivity patterns. These embeddings can be used for various tasks such as node classification, link prediction, and community detection in complex networks.

4. Node2Vec

Node2vec draws inspiration from the Random Walk approach. The key distinction lies in its use of biased walks rather than random walks to generate and learn embeddings. This approach aims to employ flexible biased random walks that balance between local and global perspectives of the network.

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

In the image above, two strategies are employed to calculate walks of length 3. The first strategy, Breadth First Search (BFS), provides a local microscopic view by traversing neighboring nodes. The second strategy, Depth First Search (DFS), offers a global macroscopic view by exploring distant nodes within the graph.

iased fixed-length random walk 𝑹, given a node 𝒖, generates a neighborhood 𝑵𝑹 𝒖. Two key parameters are involved:

Return parameter 𝒑: Determines the likelihood of returning to the previous node.
In-out parameter 𝒒: Balances the movement towards neighboring nodes (DFS) versus returning to nodes already visited (BFS). Intuitively, 𝒒 represents the “ratio” of BFS to DFS behaviors.

Biased 2nd-order random walks explore network neighborhoods in the following manner:

After traversing the edge (𝑠7, 𝑤), the random walk is now at node 𝑤.
Insight: The neighbors of 𝑤 can only be:

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

Make sure to remember the origin of the walk.

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

𝑝, 𝑞 model transition probabilities
- 𝑝 … return parameter
- 𝑞 … ”walk away” parameter

Walker came over edge (𝐬𝟏, 𝐰) and is at 𝐰. Where to go next?

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

BFS-like walk: Low value of 𝑝
DFS-like walk: Low value of 𝑞

𝑁ᵣ(𝑢) are the nodes visited by the biased walk.

Algorithm:

Compute random walk probabilities.
Simulate 𝑟 random walks of length 𝑙 starting from each node 𝑢.
Optimize the node2vec objective using Stochastic Gradient Descent.

Advantages:

Linear-time complexity
All 3 steps are individually parallelizable

Note: It has been observed in different works that node2vec performs better on node classification while alternative methods perform better on link prediction.

Graph Embeddings

The first two approaches are fundamental, so I’ll provide a brief overview without diving into details. Then, we’ll discuss popular graph embedding algorithms such as Anonymous Walk Embeddings and Graph2Vec.

1. First Approach:

Apply a standard graph embedding technique to the (sub)graph 𝐺.
Then simply aggregate the node embeddings in the (sub)graph 𝐺 by summing or averaging them.

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

2. Second Approach

Introduce a “virtual node” to represent the (sub)graph and apply a standard graph embedding technique to it.

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

3. Anonymous Walk Embeddings

States in anonymous walks correspond to the index representing the first visit to each node during a random walk.

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

Steps:

Simulate anonymous walks of length 𝑙 and record their frequencies.
Represent the graph using a probability distribution based on these walks.

For example:

Let 𝑙 = 3
We represent the graph with a 5-dimensional vector corresponding to the 5 anonymous walks 𝑤# of length 3: 111, 112, 121, 122, 123.
𝑍𝓰[𝑖] = probability of anonymous walk 𝑤5 in 𝐺

Now, let’s discuss how to sample the anonymous walks:

Generate a set of 𝑚 independent random walks.
Represent the graph using a probability distribution based on these walks.
How many random walks 𝑚 do we need?
- We require 𝑚 such that the distribution’s error is less than 𝜀 with a probability greater than 𝛿.

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

The previous approach represents each walk by its frequency of occurrence. What if, instead, we could learn embeddings zᵢ for each anonymous walk wᵢ?

We aim to learn a graph embedding 𝒁𝓰 along with embeddings 𝒛ᵢ for all anonymous walks:
- 𝒁 = {𝒛ᵢ : 𝑖 = 1 ... 𝜂}, where 𝜂 is the number of sampled anonymous walks.
How do we embed walks?
- Embed walks so that the subsequent walk can be predicted.
We use a vector parameter 𝒁𝓰 for the input graph.
- The entire graph's embedding is to be learned.
Starting from node 1: Sample anonymous random walks. For example:

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

5. Learn to predict walks that co-occur within a Δ-size window (e.g., predict 𝑤; given 𝑤<, 𝑤= if Δ = 1).

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

6. Sum the objective across all nodes in the graph.

7. Conduct 𝑻 different random walks from 𝒖, each of length 𝒍.

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

8. Learn to predict walks that co-occur within a Δ-size window.

9. Estimate the embedding 𝑧5 of the anonymous walk 𝑤5. Let 𝜂 be the number of all possible walk embeddings.

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

10. After optimization, we obtain the graph embedding 𝒛𝑮 (a learnable parameter).

11. Utilize 𝒁𝓰 to make predictions, such as graph classification:
— Option 1: Inner product kernel 𝒁ᵀ𝓰₁ 𝒁𝓰₂.
— Option 2: Employ a neural network that takes 𝒁𝓰 as input for classification.

4. Graph2Vec

Graph2Vec is a method for learning fixed-size embeddings (vector representations) of entire graphs, which encapsulate the structural and topological properties of the graphs. Unlike traditional node embeddings that represent individual nodes in isolation.

https://arxiv.org/pdf/1707.05005

Steps to calculate Graph2Vec embeddings:

Graph Representation:
- Input: Start with a dataset consisting of multiple graphs G={G₁, G₂, …, Gₙ}
- Each graph Gᵢ is represented as a structured entity with nodes, edges, and optionally node attributes and edge weights.
Feature Extraction:
- Extract meaningful features from each graph Gᵢ. This typically involves capturing global and local structural characteristics of the graph.
- Example features could include node degree distribution, graph centrality measures, subgraph statistics, or any other relevant graph properties.
Subgraph Sampling (Neighborhood Sampling): For each graph Gᵢ
- Random Walks or Neighborhood Sampling: Perform random walks or other neighborhood sampling techniques to extract subgraphs (neighborhoods) around each node within Gᵢ.
- Context Extraction: Collect these sampled subgraphs to use as local contexts for learning embeddings.
Graph Embedding Learning:
- Embedding Model: Utilize a model inspired by Skip-gram (used in word2vec)
- Objective: Train the model to predict the local contexts (sampled subgraphs) given a central subgraph (target) within each graph Gᵢ
- Loss Function: Define a loss function (typically cross-entropy or negative sampling based) to optimize the model parameters.
Training:
- Optimization: Use stochastic gradient descent (SGD) or other optimization techniques to minimize the loss function.
- Iterations: Iterate over the dataset multiple times (epochs) to improve the quality of the embeddings.
- Hyperparameters: Adjust hyperparameters such as embedding dimensions, window size (for context sampling), learning rate, and number of iterations based on validation performance.
Graph2Vec Embeddings:
- Output: After training, obtain fixed-size vector representations (embeddings) for each graph Gᵢ.
- Each embedding vᵢ captures the structural and topological properties of the corresponding graph Gᵢ.

So, that it for graph representational learning. There are a couple of other algorithms which I haven’t discussed. Feel free to explore more or comment about them. Some of them are LINE, NetMF etc.

Graph Representational Learning: Creating node and graph embeddings was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.

Traditional ML for Graphs

Dhaval Taunk — Sat, 08 Jun 2024 07:18:12 GMT

In the last blog post, I discussed in brief about different topics in the context of Graph Neural Networks. In this blog post, I will be discussing about the traditional feature extraction approaches that were popular in the past and used along with classical ML algorithms to solve different graph based problems. So let’s get started…

Hand Crafted Features

Traditional ML pipelines used hand-crafted features to perform the modelling. To achieve a good performance, effective feature creation is a crucial step. Overall, we can divide the feature extraction process in 3 categories:

Node Level
Edge/Link Level
Graph Level

We will be discussing different approaches for all three categories one by one. So tighten your seat belt and lets begin…

Node Level Features

1. Node Degree

Node Degree refers to simply counting the number of edges for each node and treat them as features.

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

Although, this approach seems to be pretty straight forward, it has a drawback. It does not considers the importance of node into account. Therefore, we need a better strategy that takes node importance into consideration as well.

2. Node Centrality

It is a feature extraction method which takes node centrality into account. There are different ways by which we can calculate node centrality, let’s discuss them one by one.

Eigenvector Centrality: The idea behind this is that a node is important if it is surrounded by important neighboring nodes. It is calculated as the sum of centrality of neighboring nodes.

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

If we rewrite the above equation in matrix form, we can observe that eigenvector centrality is nothing but the eigenvector. The leading eigenvector will be used to calculate the centrality.

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

2. Betweenness Centrality: According to betweenness centrality, a node is important if it lies on many shortest paths between other nodes.

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

3. Closeness Centrality: It says that a node is important if it has smallest path lengths to all other nodes.

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

3. Clustering Coefficient

It is based on the idea that how well a node’s neighboring nodes are connected. It can be calculated by below equation:

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

The above network where there is a central node and other nodes surround it is called ego network.

4. Graphlet Degree Vector (GDV)

In GDV, we calculate number of graphlets rooted at a given node. So the question comes, what is graphlet? It is basically a rooted connected non-isomorphic sub-graph of a given graph. Below image shows possible graphlets till 5 node sub-graphs.

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

So how to calculate GDV? The below images shows how to calculate it followed by it’s description:

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

In the above graph, you can see how to calculate GDV for sub-graph rooted at v. We will calculate 2 and 3 node graphlets for this sub-graph.

Important observation

Node degree counts number of edges that a node touches

Clustering Coefficient counts number of triangles that a node touches

GDV counts number of graphlets that a node touches

Overall, the above methods can be divided into two categories:

1. Importance-based Features
      - Node Degree
      - Node Centrality

2. Structure-based Features
      - Node Degree
      - Clustering Coefficient
      - Graphlet Degree Vector

Edge/Link Level Features

1. Distance based features

Shortest Path distance between two nodes: In this approach, we simply calculate the shortest path distance between two nodes. However, this approach does not capture the degree of neighborhood overlap. Therefore, it is not a very good metric to calculate edge features, although a good starting point.

2. Local Neighborhood Overlap

Common Neighbors: Calculates the number of common neighbors, two nodes share.

2. Jaccard’s Coefficient: Calculates the intersection over union of neighboring nodes of two nodes.

3. Adamic-Adar Index: It is calculated as one over log of neighboring nodes, a node has.

The main issue with local neighborhood overlap is that the value will be zero if there are no common neighbors between two nodes.

3. Global Neighborhood Overlap

To overcome the limitation of local neighborhood overlap, global neighborhood overlap comes in the picture.

Katz Index: It counts the number of paths of all lengths between all pair of nodes. To do so, it utilizes the adjacency matrix of the input graph. Let’s now discuss about how to calculate it:

Let Pᵤᵥᵏ = #path of length k between u and v

===> Pᵤᵥ¹ = #path of length 1 between u and v which is nothing but the adjacency matrix A of input graph

Now, how to compute #paths of length 2?

Step 1: Compute #paths of length 1 between each of node u’s neighbor and v.

Step 2: Sum up these #paths across u’s neighbors using below equation:

You can see that the #path of length 2 is nothing but the multiplication of adjacency matrix to itself. Therefore, in this way, you can calculate #paths for any arbitrary length.

To finally calculate the Katz index, just do a summation of #paths for all length across any two nodes v1 and v2 as shown in below figure:

Finally, to calculate the entire matrix, you can use the below equation:

Summary

1. Distance based features
      - Uses shortest path length b/w two nodes but does not capture
neighborhood overlaps.

2. Local Neighborhood Overlap
      - Captures number of common neighbors but can become zero if no 
common neighbors.

3. Global Neighborhood Overlap
      - Uses entire graph structure to calculate Katz Index.

Graph Level Features

While creating graph level features, kernel level features are popular approach. They are inspired from classical machine learning where kernel methods are quite popular. We will be discussing about two popular methods known as Graphlet Kernels and Weisfeiler-Lehman Kernel.

1. Graphlet Kernels

The idea of graphlet kernel is to count the number of graphlets in a graph like we did in Graphlet Degree Vector. Although the idea of graphlet here is slightly different. Here, graphlets need not to be disconnected and not rooted as well. Below example shows how to count graphlet for 3 node subgraph:

So, how to create graphlet kernel with this approach. Assume you have a graph G, then the graphlet vector Gⱼ can be defined as graphlet count vector:

(f𝓰)ᵢ = #(gᵢ ⊆ G) for i = 1,2,….,nⱼ

The below image shows how to calculate graphlet vector for the sample k = 3 for the input graph G.

Then, if you have two graphs G and G’, you can calculate graphlet kernel with the below formula:

Although, one thing you should keep in mind that the size of graph G and G’ can be different. In that case, you can normalize the vectors to make them same scale.

There is one issue with the above approach. The approach too much complex to implement. Counting size-k graphlets for a graph with size n take nᵏ time.

2. Weiseiler-Lehman Kernel

To overcome the complexity of Graphlet Kernels, WL kernels were introduced. The WL Kernels are based on color refinement algorithm which is defined below:

Given: A graph 𝐺 with a set of nodes 𝑉.
1. Assign an initial color 𝑐 , 𝑣 to each node 𝑣.
2. Iteratively refine node colors by

where HASH maps different inputs to different colors.
3. After 𝐾 steps of color refinement, 𝑐⁽ᵏ⁾(𝑣) summarizes the structure of 𝐾-hop neighborhood.

The below set of images show a example of how WL Kernel works:

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

Then at the last, the WL Kernel value can be calculated by taking the dot product of vectors attained using WL Kernel.

Stanford CS224W: Machine Learning for Graphs https://cs224w.stanford.edu

That’s all for traditional ML for graph. In the next post, I will be discussing about the different embedding algorithms. So stay tuned….

Traditional ML for Graphs was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.

Graph Neural Networks: An Introduction

Dhaval Taunk — Wed, 29 May 2024 17:48:34 GMT

Credits: https://bdtechtalks.com/2021/10/11/what-is-graph-neural-network/

In this series of blog posts on graph neural networks, we will discuss the basics of graph neural networks, their use cases, why they are required, and their advantages over conventional fully connected neural networks.

So let’s get started…..

Why Graph Neural Networks?

Many data sources are in a graph-based format where they can be represented as nodes (vertices) connected by edges, capturing relationships and dependencies between entities. Due to their complex structure, traditional fully connected networks are not effective at interpreting these kinds of datasets.

Such datasets have a rich relational structure that can be better represented as a relational graph. Therefore, graph neural networks offer a promising avenue for modeling these types of datasets.

Use Cases of Graph Neural Networks (GNN’s)

GNNs offer solutions to a wide range of use cases involving graph-structured data. One of the most popular examples is social media analysis. With the vast user bases of platforms like Facebook and Instagram, each user can be treated as an individual node in a graph, with their profile features serving as attributes. By leveraging their friends lists, we can construct a large graph. This is a classical example where graph neural networks outperform fully connected neural networks.

Other use cases include communication networks, citation networks, economic networks, knowledge graphs, scene graphs, and more. Graph neural networks can effectively tackle various problems across these domains.

Graph Neural Network Applications

GNNs can be used to solve a variety of problems. Below, I am listing the applications. We will discuss them in detail in the upcoming blogs.

1. Graph Level Prediction:
    - Graph classification (Drug Discovery, Molecule Generation etc.)

2. Sub-Graph Level:
    - Traffic Prediction (Google Maps etc.)

3. Node Level:
    - Node Classification (Categorize online users/items etc.)

4. Edge Level:
    - Link Prediction (Recommender Systems, Drug Side effects etc.)

Popular Tools

There are a variety of tools available for graph-based data analysis and modeling. Below, I am listing some of them. In the next set of blogs, I will discuss them in detail, providing hands-on examples as well.

1. NetworkX: 
   - It is a Python library for the graph creation, manipulation, and study of 
complex networks.

2. Pytorch Geometric: 
   - It is a library for deep learning on irregularly structured data such as 
graphs and it is  built on top of PyTorch.

3. DeepSnap:
   - It is a Python library that facilitates deep learning on graphs by 
providing easy-to-use data structures and tools for graph manipulation and 
model training.

4. GraphGym:
   - It is a research platform built on PyTorch Geometric that offers modular 
and flexible tools for designing, training, and evaluating graph neural 
networks.

Machine Learning for Graph

Before the advent of graph neural networks, people used to follow traditional ML pipelines, where they extracted features from the input graph and fed them into classical ML models to perform various tasks. These features can also be divided based on the graph structure in the following manner:

1. Node Level Features:
      a. Node Degree
      b. Node Centrality
      c. Clustering Coefficient
      d. Graphlets

2. Edge Level Features:
      a. Distance Based Features: 
            i. Shortest-path distance between two nodes
      b. Local Neighborhood Overlap:
            i.  Common Neighbors
            ii. Jaccard's Coefficient
            iii. Adamic-Adar Index
      c. Global Neighborhood Overlap:
            i. Katz Index 

3. Graph Level Features:
      a. Graph Kernels
            i. Graphlet Kernel
            ii. Weisfeiler-Lehman Kernel

I will be discussing them in detail in the upcoming blogs. Stay tuned for that!

Types of Graph Neural Network

There are numerous graph neural networks available. Discussing them in detail will require another set of blogs, which I will cover in the next series. However, for the sake of discussion and to provide an overview, I am listing some of the popularly used networks.

1. Vanilla GNN
2. Graph Convolutional Networks (GCN)
3. Graph Attention Networks (GAT)
4. GraphSage
5. Relational Graph Convolutional Networks (R-GCNs)
6. Graph Recurrent Neural Networks (GRNNs)

The above list is not exhaustive, but it includes the most prominent networks in use. Details will follow in the next set of blogs.

For now, that’s all from my side. In summary, in this blog, I briefly discussed the utility of graph neural networks, their applications, types, etc. In the next blog, we will discuss classical ML for graphs and continue to delve into graph neural networks in subsequent blogs.

So stay tuned and happy reading……..

Graph Neural Networks: An Introduction was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.

BERT — Pre-training + Fine-tuning

Dhaval Taunk — Sun, 26 Dec 2021 06:50:23 GMT

BERT — Pre-training + Fine-tuning

Source — https://ruder.io/content/images/2021/02/fine-tuning_methods.png

Huggingface.co has made using the transformers-based model convenient with their Transformers API. But a lot of time, only fine-tuning does not work. Pre-training on the unlabelled data and then fine-tuning helps the model achieve the desired results. Huggingface API provides the pre-training functionality as well. In this blog post, I will be explaining how to perform pre-training and then fine-tuning a transformers based model. For this purpose, I will be using BERT as a reference model.

Data Formatting

To perform pre-training, the data must be in a specific format. It should be in a text file (.txt format) with one sentence per line. The purpose of this text file is first to tokenize the data using Word Piece tokenizer and then perform pre-training on the data.

Pre-training model

Train tokenizer on the text

After converting the data in the required format, the next step is to train the tokenizer on input data. This step is helpful to create the vocabulary of the data. The below code gist shows how to tokenize the text using Word Piece Tokenizer. To read more about Word Piece Tokenizer, you can refer to section 4.1 from the below link:-

https://arxiv.org/pdf/1609.08144v2.pdf

https://medium.com/media/fb819a089f9fb06bfd318c549ffe3854/href

Train BERT for MLM task

The next step will be to pre-train BERT for the masked language modelling task. For this purpose, we will be using the same dataset we used to train the tokenizer for this purpose. For the MLM task, 15% of tokens are randomly masked, and then the model is trained to predict those tokens. This functionality is present in the Huggingface API, which is given in the below code:-

https://medium.com/media/309b82465488ab4e4be0d714106f4b0a/href

Till now, we are done with the pre-training part. Let's move to fine-tuning part.

Finetuning Model

Data Preparation

For the fine-tuning section, the data must be in a different format from what we used in the pre-training part. BERT takes three inputs viz. — input_ids, attention_mask, token_type_ids. I won't be going into the details of what are they. You can refer to them from the BERT paper. Here, I will be explaining how to calculate them from the Huggingface API. I will be using the BERT model for classification purposes here. One can make changes in their code according to their convenience.

https://medium.com/media/2ef375b6bd5a4d846d8ae2d68035dbe7/href

In the above code, I have used the Dataset class from torch.utils and BERT's tokenizer to convert the data in the required format. The in the next step, I am creating a DataLoader class for training and testing purposes.

Model Defining

Let's start with the model-building part now for the fine-tuning purpose. I will be adding two linear layers on top of BERT for the classification purpose with dropout = 0.1 and ReLU as an activation function. One can try different configurations as well. I have defined PyTorch class to build the model which is there in the below code:-

https://medium.com/media/092cc0e60b046808ad2c16cc9d872587/href

Train and validation function

The last step is to define the training and validation function to perform fine-tuning. This will be a usual function that is used in PyTorch by everyone. The below code depicts this:-

https://medium.com/media/36e57569d15b46f91effdf52c6a68669/href

Voila, now you are done with all the required steps to achieve the goal. But one can try different configurations as mentioned above. Also, you can try for a different task than classification as mentioned above. If you want to complete the code, you can visit the below link:-

GitHub - DhavalTaunk08/NLP_scripts: Contains notebooks related to various transformers based models for different nlp based tasks

This is all from my side this time. If you want to read more related to ML/DL, visit the below link and if you like, do give it a clap.

Dhaval Taunk - Medium

If you liked my article:

BERT — Pre-training + Fine-tuning was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.

Search Engine in Python from scratch

Dhaval Taunk — Sun, 10 Oct 2021 12:21:24 GMT

Source:- https://www.pinterest.com/pin/847450854860596454/?amp_client_id=CLIENT_ID(_)&mweb_unauth_id=&simplified=true

In this post, I will be going through all the details of building a search engine from scratch using the Wikipedia dump (approximately 84GB in size). I will be going through the step-by-step process of creating a primary index of data and a secondary index—also, how to implement search functionality to output the results in minimum time and all. I will be dividing the post into two parts viz Requirements, Wiki dump details, Indexing and Searching. So tighten your seat belts, and let's get started.

1. Requirements

Python3

NLTK stopwords

PyStemmer

2. Wiki dump details

For creating the search engine, in this post, I will be referring to the Wikipedia dump for the English language, which is approximately 84GB in size. One can download the data from the below-given link:-

https://dumps.wikimedia.org/enwiki/20210720/enwiki-20210720-pages-articles-multistream.xml.bz2

You will need to download and extract the downloaded dump. Alternatively, you can work on the zipped data as well. I will be explaining all the things on the extracted dump.

3. Creating the Index for Wiki dump

(i) Parsing the XML dump

First of all, you will be required to parse the XML and get the necessary data. For this, there are a couple of parsers available in python. Some of them are:-

SAX

Etree

DOM

Here I will be using SAX parser to parse the XML. You can try out different other parsers as well. In the below gist, I have shown how to use the SAX parser to parse the data.

https://medium.com/media/5ab880ea001ad802dc96887ffda11108/href

I am using only two fields, i.e. title and text, from XML in the above code. I am giving my own id's from the variable num_pages. How I am using the title, and the text is explained below in different sections.

(ii) How to preprocess text

It is an essential task, as this step will ensure we are not adding unnecessary terms to the Index. Otherwise, it will blow the index size. Majorly, I will remove stopwords, tokenise the text, remove HTML tags, remove non-ASCII characters, etc. It is shown in the below code.

https://medium.com/media/eb7f8412c4742f45d75bcf0fbea11bc6/href

(iii) Extract different fields

There can be different possible fields on which we can do query searching. I will be using six fields: title, body, category, infobox, links, and references. One can search generic queries or field-specific queries using these fields.

https://medium.com/media/f400109357ce59ec03f5917cb8da3bb6/href

(iv) Creating an intermediate index

Creating the final Index directly will be a heavy task and can blow up memory as well. Therefore, we will first create an intermediate index on data sections and then perform the final merging to make a final index. We will be using the SPIMI approach. You can read more about it using the below link.

Single-pass in-memory indexing

The below code shows how to create an intermediate index.

https://medium.com/media/ee94ebd92c26c703b7774b47914fe09f/href

Here I am processing the text, token by token, adding it to a dictionary, and then writing it in intermediate files. The format of index files now looks this:-

apple-2314:t3b6i2r1;6432:t5c8b3i1;

Here in the above example, "apple" is a token. Then after the hyphen, every pair is separated by ";". The first value before ":" represents the docID for the document, and then the corresponding string and value represents frequency.

Ex.:- t3b6i2r1 represents token appears 3 times in title, 6 times in the body, 2 times in infobox and 1time in reference.

(iv) Merging the intermediate index

Now when we are done with writing the intermediate Index. We need to merge the indexes because there will be instances where the token will have its info split into multiple files. We need to merge it for creating the final Index. The below code do this:-

https://medium.com/media/3e79537cafcd7ae03b7b2903ebe4a1f5/href

The format of the final Index looks like the below example:-

apple-2141:5;1232:1;5432:78;

Here in the above example, "apple" is a token. Then after the hyphen, every pair is separated by ";". The first value before ":" represents the docID for the document, and the second value represents the frequency of the token for that particular token from the field file. Here in the final files are separated fields wise files, unlike in intermediate indexes.

In this way, the final Index will look similar to the above example shown. If you want to see the entire code, you can visit the Github code link given at the last of this blog and also try out different approaches you want to try out.

(v) Secondary Index

You can create a secondary index as and when required. This will help in searching fast by keeping the information of tokens in the secondary Index. I will be using the below format of the secondary Index. One can try out other possible forms as well.

apple-563-1-3-4--2-4-6-

According to the above format, 'apple' will be the token. Then the value 563 will denote the frequency of apple in the entire Wikipedia dump. After that, the number 1 will indicate in which file number the token is present. The token can appear in any field. The number 1 will suggest that if the token is present in that field, it will only be there in that particular file number for that field. After that, all other values will be optional. I will be going about the details of the field values below:-

As mentioned above, I will be using fields title, body, category, infobox, link, reference. So the values between '-' will denote the line number in the final index files. Below example shows it:-

'apple-563-1-3-4--2-4-6-'.split('-') ---> ['apple', '563', '1', '3', '4', '', '2', '4', '6', '']

Here the fourth element ('3') indicates that 'apple' appears at line number 3 of file number 1 for the title file. Similarly, '4' will show that it will be present at line number 4 of the body file; it does not appear in the category file as it is an empty string ("), it appears at line number '2' for the infobox file, it appears at line number '4' of link file, and at line number '6' in the reference file. One thing that will be common for all the files is that if the token is present for any field, it will be present in that particular file number only, nowhere else.

4. Implementing Search functionality

One of the crucial things required for searching functionality is implementing the ranking functionality to rank the document according to its relevance. But before that, few other things are also needed, which I will be explaining below:-

(i) Preprocessing the query

The preprocessing will be the same which we did during the indexing phase. All the steps will be the same as the initial steps. Then we will get the final preprocessed query.

(ii) Identify query type

This is one of the mode useful steps as it will guide us whether we need to the simple query or field query or both. So basically a query can be of 3 types:-

Type 1:- world war II
Type 2:- t:world cup i:2012
Type 3:- Sachin Tendulkar t:world cup i:2012

We can identify the type of query using the below code-

https://medium.com/media/1d3339a6f210147cdd25c2005252bcd9/href

(iii) Ranking functionality

Ranking functionality is used to get the ranked results. So that the relevant results are on top. Usually, tf-idf is a metric to rank documents.

So what is tf-idf? It is composed of two terms TF and IDF.

TF (Term Frequency):- It tells us the frequency of a term in a document.

IDF (Inverse Document Frequency):- In documents, a lot of words that occur have a large frequency. But this large frequency reduces the relevance of the word for that document. So to reduce the effect of that large frequency words, IDF is used. It is basically a log of the total number of documents divided by the frequency of the word for that document.

So the final tf-idf value is the product of tf and idf values.

tf-idf = tf * idf

(iv) Variants of tf-idf

Source:- https://nlp.stanford.edu/IR-book/pdf/06vect.pdf

One can use any one of the above-given tf-idf variants. It's up to their implementation and choice.

So, it is all for now from my side. If you want to see the full code, you can visit the below link:-

GitHub - DhavalTaunk08/Wiki-Search-Engine

If you want to read more about machine learning, deep learning, do visit the below link:-

Dhaval Taunk - Medium

Happy reading…….

If you like my article:

Search Engine in Python from scratch was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.

Finetune DistilBERT for multi-label text classsification task

Dhaval Taunk — Thu, 17 Sep 2020 05:14:02 GMT

Source — https://developer.nvidia.com/blog/efficient-bert-finding-your-optimal-model-with-multimetric-bayesian-optimization-part-1/

In one of my last blog post, How to fine-tune bert on text classification task, I had explained fine-tuning BERT for a multi-class text classification task. In this post, I will be explaining how to fine-tune DistilBERT for a multi-label text classification task. I have made a GitHub repo as well containing the complete code which is explained below. You can visit the below link to see it and can fork it and use it.

https://github.com/DhavalTaunk08/Transformers_scripts

Introduction

The DistilBERT model (https://arxiv.org/pdf/1910.01108.pdf) was released by Huggingface.co which is a distilled version of BERT released by Google (https://arxiv.org/pdf/1810.04805.pdf).

According to the authors:-

They leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40% while retaining 97% of its language understanding capabilities and being 60% faster.

So let’s start with the details and the process to fine-tune the model.

Multi-Class v/s Multi-Label classification

First of all, it is important to understand the difference between multi-class and multi-label classification. Multi-class classification means classifying the samples into one of the three or more available classes. While in multi-label classification, one sample can belong to more than one class. Let me explain it more clearly by an example:-

Multiclass classification — Let say we have 10 fruits. They can belong to one of the three classes — ‘apple’, ‘mango’ and ‘banana’. If we are asked to classify the fruits in these given classes, they can belong to only one of these classes. Therefore, it is a multi-class classification problem.

Multi-label classification — Let say we have few movie names and our task is to classify these movies into the genres to which they belong to like ‘action’, ‘comedy’, ‘horror’, ‘sci-fi’, ‘drama’ etc. These movies can belong to more than one genre. For example — ‘The Matrix movie series belongs to the ‘action’ as well as ‘sci-fi’ category. Thus it is called multi-label classification.

Data Formatting

First of all, there is a need to format the data. The required data can contain 2 columns. One column containing text to be classified. Another column containing labels related to that sample. The below image is an example of the data frame:-

The above example shows that we have six different classes and the sample can belong to any number of classes.

But the question is how to convert the labels into this format? Here, scikit-learn comes to the rescue!!!

Below is an example of how to convert these labels to the required format.

>>> from sklearn.preprocessing import MultiLabelBinarizer
>>> mlb = MultiLabelBinarizer()
>>> mlb.fit_transform([{'sci-fi', 'thriller'}, {'comedy'}])
array([[0, 1, 1],
       [1, 0, 0]])
>>> list(mlb.classes_)
['comedy', 'sci-fi', 'thriller']

Also, you can refer to the below link to get more details about it.

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html

Code

Now let’s get to the code part about the required libraries, how to write DataLoader, and model class for this task.

Required libraries

transformers==3.0.2

torch

scikit-learn

numpy

pandas

tqdm

These can be installed with the ‘pip install’ command.

Importing libraries

import numpy as np
import pandas as pd
import transformers
import torch
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
from transformers import DistilBertModel, DistilBertTokenizer
from tqdm import tqdm
from sklearn.preprocessing import MultiLabelBinarizer

from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'

The above step is to set up the device for GPU.

Training parameters

MAX_LEN = 256
TRAIN_BATCH_SIZE = 8
VALID_BATCH_SIZE = 4
EPOCHS = 1
LEARNING_RATE = 1e-05

These parameters can be tuned according to one’s needs. But there is one important point to be noted here:-

DistilBERT accepts a max_sequence_length of 512 tokens.

We cannot give max_sequence_length more than this. If you want to give a sequence length of size more than 512 tokens, you can try the longformer model (https://arxiv.org/pdf/2004.05150)

DataLoader

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

class MultiLabelDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_len):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.text = dataframe.text
        self.targets = self.data.labels
        self.max_len = max_len

    def __len__(self):
        return len(self.text)

    def __getitem__(self, index):
        text = str(self.text[index])
        text = " ".join(text.split())

        inputs = self.tokenizer.encode_plus(
            text,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True,
            return_token_type_ids=True
        )
        ids = inputs['input_ids']
        mask = inputs['attention_mask']

        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'targets': torch.tensor(self.targets[index], dtype=torch.float)
        }

Calling the tokenizer and loading the dataset. Here, train_dataset and val_dataset will be training and validation datasets in pandas data frame format with column names as [‘text’, ‘labels’].

training_set = MultiLabelDataset(train_dataset, tokenizer, MAX_LEN)
testing_set = MultiLabelDataset(test_dataset, tokenizer, MAX_LEN)

train_params = {'batch_size': TRAIN_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

test_params = {'batch_size': VALID_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

training_loader = DataLoader(training_set, **train_params)
testing_loader = DataLoader(testing_set, **test_params)

The above step converts the data into the required format using the MultiLabelDataset class and PyTorch's DataLoader. You can read more about DataLoader by visiting the below-given link:-

https://pytorch.org/tutorials/beginner/data_loading_tutorial.html

Model Class

class DistilBERTClass(torch.nn.Module):
    def __init__(self):
        super(DistilBERTClass, self).__init__()
        self.l1 = DistilBertModel.from_pretrained("distilbert-base-uncased")
        self.pre_classifier = torch.nn.Linear(768, 768)
        self.dropout = torch.nn.Dropout(0.3)
        self.classifier = torch.nn.Linear(768, num_classes)

    def forward(self, input_ids, attention_mask):
        output_1 = self.l1(input_ids=input_ids, attention_mask=attention_mask)
        hidden_state = output_1[0]
        pooler = hidden_state[:, 0]
        pooler = self.pre_classifier(pooler)
        pooler = torch.nn.ReLU()(pooler)
        pooler = self.dropout(pooler)
        output = self.classifier(pooler)
        return output

Here, I have used 2 linear layers on top of the DistilBERT model with a dropout unit and ReLu as an activation function. num_classes will be the number of classes available in your dataset. The model will return the logit scores for each class. The class can be called by the below method:-

model = DistilBERTClass()
model.to(device)

Loss function and optimizer

def loss_fn(outputs, targets):
    return torch.nn.BCEWithLogitsLoss()(outputs, targets)

optimizer = torch.optim.Adam(params =  model.parameters(), lr=LEARNING_RATE)

Here, BCEWithLogitsLoss is used which is used generally for multi-label classification. One can read more by visiting the below link:-

https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html

Training function

def train_model(epoch):
    model.train()
    for _, data in enumerate(training_loader, 0):
        ids = data['ids'].to(device, dtype = torch.long)
        mask = data['mask'].to(device, dtype = torch.long)
        targets = data['targets'].to(device, dtype = torch.float)
        
        outputs = model(ids, mask)

        optimizer.zero_grad()
        loss = loss_fn(outputs, targets)
        if _%1000==0:
            print(f'Epoch: {epoch}, Loss:  {loss.item()}')
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

for epoch in range(EPOCHS):
    train_model(epoch)

The above function is used for training the model for the specified number of epochs.

Validation

def validation(testing_loader):
    model.eval()
    fin_targets=[]
    fin_outputs=[]
    with torch.no_grad():
        for _, data in enumerate(testing_loader, 0):
            ids = data['ids'].to(device, dtype = torch.long)
            mask = data['mask'].to(device, dtype = torch.long)
            token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
            targets = data['targets'].to(device, dtype = torch.float)
            outputs = model(ids, mask, token_type_ids)
            fin_targets.extend(targets.cpu().detach().numpy().tolist())
            fin_outputs.extend(torch.sigmoid(outputs).cpu().detach().numpy().tolist())
    return fin_outputs, fin_targets

outputs, targets = validation(testing_loader)
outputs = np.array(outputs) >= 0.5

accuracy = metrics.accuracy_score(targets, outputs)
f1_score_micro = metrics.f1_score(targets, outputs, average='micro')
f1_score_macro = metrics.f1_score(targets, outputs, average='macro')

print(f"Accuracy Score = {accuracy}")
print(f"F1 Score (Micro) = {f1_score_micro}")
print(f"F1 Score (Macro) = {f1_score_macro}")

Here I have used accuracy and f1_score for now. But usually, the hamming loss and hamming score are the better metrics for calculating loss and accuracy for multilabel classification tasks. I will be discussing that in my next post.

So this is it for now. Stay tuned for the next post for more details on hamming loss, score, and other things. If you want to read more, you can visit my profile for more posts.

Dhaval Taunk - Medium

If you liked my article:

Finetune DistilBERT for multi-label text classsification task was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.

NLP Preprocessing:- A useful and important step

Dhaval Taunk — Sun, 26 Jul 2020 06:19:01 GMT

Source — https://s3.amuction/post_images/435/NLP/original.jpg?1506438363

Introduction

GPT-3 model has, for now, became a hot topic in the natural language processing field due to its performance. It has nearly 175 billion parameters in comparison to GPT-2 which had around 1.5 billion parameters. It's a major breakthrough in the field of NLP. But the preprocessing steps that are required before training any model is of utmost importance. Therefore in this article, I will be explaining all the major steps that are used and are required in preprocessing the data before training any NLP model.

First I will list out the preprocessing steps and then will explain them in detail:-

Removing HTML tags
Removing stopwords
Removing extra spaces
Converting numbers to their textual representations
Lowercasing the text
Tokenization
Stemming
Lemmatization
Spell-checking

Now let’s start with their explanation one by one.

Removing HTML tags

Sometimes the text data could contain the HTML tags along with the normal text if the data has been web-scraped from the internet. This could be removed by using python’s BeautfulSoup library because these tags will not be of any use or these tags can be removed using regex as well. The code is explained below:-

https://medium.com/media/3fffe2b4f5e9da8354dedfff346437ba/href

Removing stop-words

Many times the data contains a large number of stop-words. These might not be useful because they won’t be making any significant impact on the data. These can be removed by using nltk or spacy library. The code is shown below:-

https://medium.com/media/0d91a4938b125d5e87ee32e20fc3aef5/href

Removing extra-spaces

There might be certain situations where the data might contain extra spaces within the sentences. These can be easily removed by python’s split() and join() functions.

https://medium.com/media/21e6d344233cd18c8dc000c155bce22a/href

Converting numbers to their textual representations

Converting numbers to their textual form is also much useful in NLP preprocessing steps. For this purpose, the num2words library can be used.

https://medium.com/media/f60bd061ca6f74066275615cdfdcf7c7/href

Lowercasing the text

Converting all the words in data into lowercase is a good practice to remove redundancies. There might be a possibility that words may appear more than one time in the text. One in lowercase form, other in uppercase form.

https://medium.com/media/bb072c96e9ef8a850535d09ea4575f60/href

Tokenization

Tokenization involves converting the sentences into tokens. By tokens, I mean that splitting the sentences into words. It is also useful to separate punctuation from the words because, in the embedding layer of the model, it is much possible that the model does not have embedding present for that word. For example — ‘thanks.’ is a word with full-stop. Tokenization will split into [‘thanks’, ‘.’]. The code for doing this is expressed below using NLTK’s word_toknize:-

https://medium.com/media/dfe63ed5f2641999f6131d2a7d67d112/href

Stemming

Stemming is the process of converting any word in the data to its root form. For example:- ‘sitting’ will be converted to ‘sit’, ‘thinking’ will be converted to ‘think’ etc. NLTK’s PorterStemmer can be used for this purpose.

https://medium.com/media/1cf23406481321503e4ac0471f66008a/href

Lemmatization

Many people consider lemmatization similar to stemming. But actually they are different. Lemmatization does a morphological analysis of words that stemming does not do. NLTK has an implementation of lemmatization (WordNetLemmatizer) which can be used.

https://medium.com/media/3ccc642d4c058abc49e32cc194d36bbc/href

Spell-checking

There is much possibility that the data that is being used contain spelling mistakes. There spell-checking becomes an important step in NLP preprocessing. I will be using the TextBlob library for this purpose.

https://medium.com/media/152d80fc741ca6eadacc6fa2c2da60e2/href

Although the above spell-checker may not be perfect still it will be of good use.

All the above methods that are depicted above are one of the possible techniques available to do those steps. There are other methods available as well.

I have created a Github repo as well accumulating all the above methods in one file. You can check it by going through the below link:-

DhavalTaunk08/NLP_Preprocessing

That’s all from my side this time. Keep reading. If you want to read more, you can reach to below stories of mine:-

If you liked my article:

NLP Preprocessing:- A useful and important step was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.

How to fine-tune BERT on text classification task?

Dhaval Taunk — Sun, 07 Jun 2020 16:38:15 GMT

Source:- https://pytorch.org/tutorials/_images/bert.png

BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based architecture released in the paper “Attention Is All You Need” in the year 2016 by Google. The BERT model got published in the year 2019 in the paper — “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. When it was released, it showed the state of the art results on GLUE benchmark.

Introduction

First, I will tell a little bit about the Bert architecture, and then will move on to the code on how to use is for the text classification task.

The BERT architecture is a multi-layer bidirectional transformer’s encoder described in the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

There are two different architecture’s proposed in the paper. BERT_base and BERT_large. The BERT base architecture has L=12, H=768, A=12 and a total of around 110M parameters. Here L refers to the number of transformer blocks, H refers to the hidden size, A refers to the number of self-attention head. For BERT large, L=24, H=1024, A=16.

Source:- https://www.kdnuggets.com/2018/12/bert-sota-nlp-model-explained.html

The input format of the BERT is given in the above image. I won’t get into much detail into this. You can refer the above link for a more detailed explanation.

Source Code

The code which I will be following can be cloned from the following HuggingFace’s GitHub repo -

https://github.com/huggingface/transformers/

Scripts to be used

Majorly we will be modifying and using two scripts for our text classification task. One is glue.py, and the other will be run_glue.py. The file glue.py path is “transformers/data/processors/” and the file run_glue.py can be found in the location “examples/text-classification/”.

Format of data

The format of the data is something like this. The first column is supposed to be the id column. The second column is supposed to be the column containing the labels. And the third column should contain the text that is required to be classified.

data = pd.DataFrame()
data['id'] = [i for i in range(num_text)]
data['label'] = labels
data['text'] = text

Here num_text is the number of data points to be used, the text is the actual query that is to be classified, and labels are the label associated with its corresponding text. You should save your data in tsv format without headers present in the data.

#if data is your training file 
data.to_csv('train.tsv', sep='\t', index=False, header=None)


#if data is your validation file
data.to_csv('dev.tsv', sep='\t', index=False, header=None)


#if data is your test file
data.to_csv('test.tsv', sep='\t', index=False, header=None)

In your test file, you can ignore the labels column if you want. I had used because it can be used to check models performance after prediction. Also, the name of the files can be kept according to one’s convenience. But accordingly, changes need to be made in the glue.py file by changing the file names.

Changes to be made in the script

glue.py

path — transformers/data/processors/glue.py

https://medium.com/media/636bf5c4ee379e428d4074ea8a871675/href https://medium.com/media/3f6fbb75dbae28c9ff3ee5ce2017b37a/href

For classification purpose, one of these tasks can be selected — CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI.

I will continue with the SST-2 task; the same changes can be done with other tasks as well. The class that will be used will be -

Following are the changes needed to be made -

Change the function get_labels()’s return list from [‘0’, ‘1’] to the list of labels present in your data.
In the _create_examples() function, change -

text_a = line[text_index]

⬇ ⬇ ⬇(to)

text_a = line[-1]

3. In the dictionary defined as glue_task_num_labels, change the value of key ‘sst-2’ to the num_labels present in your data.

run_glue.py

path — examples/text-classification/run_glue.py

https://medium.com/media/5fadf9cf97b84095554ff2368cb5b95b/href

Changing this file is optional. Only make changes in this file if you want to save probabilities along with the predictions.

Predictions can be saved by saving the predictions array, which is shown in the above code in a text file.

How to run the scripts

To run the script, one can run the following command -

python ./examples/text-classification/run_glue.py \
    --model_name_or_path bert-base-uncased \
    --task_name $TASK_NAME \
    --do_train \
    --do_eval \
    --data_dir $GLUE_DIR/$TASK_NAME \
    --max_seq_length 128 \
    --per_device_eval_batch_size=8   \
    --per_device_train_batch_size=8   \
    --learning_rate 2e-5 \
    --num_train_epochs 3.0 \
    --output_dir /tmp/$TASK_NAME/

I will explain all the parameters one by one -

— model_name_or_path — This parameter specifies the BERT model type which you want to use.
— task_name — This specifies the glue task which we want to use. One from CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI can be used.
— do_train — Specifies whether you want to fine-tune model or not. If you want to use the pre-trained model, just remove this parameter.
— do_eval — If you want to perform validation
— do_predict — If you want to generate the prediction on test data.
— data_dir — Directory path where training, validation and test data needs to be saved.
— output_dir — Directory path where you want to save the predictions and other generated files.

I think all other parameters are evident by their name. Another setting which can be used is -

— save_total_limit = Specifies how many checkpoint directories you want to save.

All this is sufficient to fine-tune BERT on the text classification task.

That’s all from my side this time. I hope this blog post will help you in completing the specified task.

If you liked my article:

How to fine-tune BERT on text classification task? was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.