A nostalgic trip down Transformer Lane — Sentiment/Text Classification using Transformers
A tour on the basics of attention, positional encodings and multi-head attention used in a simple transformer based text classification project.
Table of Contents
- Averaged Attention
- Self Attention
- Multi Head Self Attention
- Absolute Positional Encoding
- Relative Positional Encoding
- Transformer Architecture for Classification
- Sentiment Analysis on IMDB Movie Reviews
Averaged Attention:
Attention is a concept in NLP for deriving the relative importance of the sequence of words in an input sentence.
It’s a weighted average of the output from a score function when applied to the inputs.
Note: The generalized concepts outlined in Bahdanau et al. (2014) are extrapolated in regards to averaged attention.
Steps:
- Output from a neural network, hᵢ
- Apply a scoring function
- Apply softmax to get a probability distribution
- Multiply the output of softmax with the corresponding inputs to get a weighted average.
The α is calculated as a softmax over the output of the score function, and the score function is simply the scaled dot product between hᵢ and the context.
Applying a softmax function over the timesteps provides a probability distribution which sums upto 1.
This allows us to filter out the timesteps with low probabilities as they would have a relative importance of zero or close to zero.
From an intuition standpoint, we are choosing to pick the words that are pertinent to the current context.
Note: When computing the score, the dot product is scaled in order to mitigate the vanishing gradients issue.
The context is calculated as the average of the features from the output of the neural network layer (hᵢ).
Self Attention:
Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. — Attention is All You Need, 2017.
Self Attention is based on queries( Q), keys(K) and values(V) which are partial linear representations derived from the input vectors/embeddings.
Note: This is relation to basic concepts outlined in Vaswani et al. (2017).
Every input vector is used in 3 ways for the self-attention operation.
- queries for specific inputs across all vectors
- keys provide similarity between itself and queries
- values provide proportional output for the corresponding similarity score between queries and keys; i.e. in essence its the relative importance of the value
All of these linear representations have learnable weights, i.e. nn.Linear layer in PyTorch.
Another way to look at this abstraction is through the lens of CNN filters. In a very broad general sense Q,K,V are analogous to 3 CNN filters applied with padding, which work on portions of an image and have learnable weights.
Ignoring the mechanics of how convolutions work / their specific properties, and just focusing at the structure of 3 CNN filters; in a rough sense are similar to Q, K, V in the aspect of partial linear transformations with learnable weights.
# A very broad high level analogy
# Example of 3 x 1D convolution filters.
# Each filter has its own learnable weights
input = torch.randn(20, 8, 96) # (B,T,D)
conv1d_filter_1 = nn.Conv1d(kernel_size=4, in_channels=96,out_channels=32, stride=1, padding='same')
conv1d_filter_2 = nn.Conv1d(kernel_size=4, in_channels=96,out_channels=32, stride=1, padding='same')
conv1d_filter_3 = nn.Conv1d(kernel_size=4, in_channels=96,out_channels=32, stride=1, padding='same')
output_1 = conv1d_filter_1(input.permute(0, 2, 1)).permute(0, 2, 1) # (B, T, D/3)
output_2 = conv1d_filter_1(input.permute(0, 2, 1)).permute(0, 2, 1) # (B, T, D/3)
output_3 = conv1d_filter_1(input.permute(0, 2, 1)).permute(0, 2, 1) # (B, T, D/3)
final_out = torch.cat([output_1, output_2, output_3], dim=2) # (B, T, D)
# Its always good to fact check that I am not totally off base here
# Found these comments under the HuggingFace github repo located at
# https://github.com/huggingface/transformers/blob/main/src/transformers/pytorch_utils.py#L93
#
# "1D-convolutional layer as defined by Radford et al. for OpenAI GPT (and also used in GPT-2)."
# "Basically works like a linear layer but the weights are transposed."
# ☺☺☺☺☺☺
These Q, K, V linear representations allow us to focus on the import aspects of the embeddings generated from the input sequence .
One more way to think about the Q, K, V function — these provide contextualized embeddings.
The dot product operations in the context of (Q,K,V) provides, how each item relates to the other items in the sequence.
Sample heat-map output of attention on a trained model illustrated below.
Multi Head Self Attention:
Instead of using a single (Q,K,V) linear projection, the (Q,K,V) is projected ‘h’ times. As we are working with the same dimensions of the input embeddings, the head ‘h’ should be fully divisible by the embedding dimensions.
For example for an input embedding size of 512, we can use 8 heads and each (Q,K,V) linear representation will be of the dimensions 64.
Multi head attentions allows us to compute these individual head (Q,K,V) representations in parallel. Additionally, they provide information about the important aspects from representation subspaces in different positions, i.e. different weights get assigned from each head to the each of the words in the input sequence.
Note: Self-attention ignores the sequential nature of the input, hence we supplement the input with absolute or relative positional encodings.
Absolute Positional Encoding:
As attention does not provide the order of the time step sequences, we need some way to inject relative or absolute position of the timestep.
One way to add absolute positioning, is to use a combination of cosine, sine frequencies that can represent a single point in time, for each timestep.
Relative Positional Encoding:
In order to add order information into the attention for the input sequence, we have seen the way to add absolute positional encodings to the inputs.
However for long sequences, we would have to encode the entire sequence. This restricts the length of the input sequence; that the models can only process so much due to system / GPU memory constraints.
Most models generally process between 512 to 1024 sequence length, although there are models that can process longer sequences.
Chunking is one way of solving the issue of processing long sequences, though its not without its own nuances.
For example:
“Employ your time in improving yourself by other men’s writings so that you shall come easily by what others have labored hard for.”
― Socrates
If we split it by 10 words, we get:
Chunk 1 — Employ your time in improving yourself by other men’s writings
Chunk 2 — so that you shall come easily by what others have
Chunk 3 — labored hard for.
On processing these chunks, the positional encodings for chunk 1, chunk 2 and chunk 3 will be the same, i.e. ‘Employ’, ‘so’, ‘labored’ will have position 0. This is commonly referred to as context fragmentation.
To preserve this context and provide a means to process long sequences; one can make use of relative positional representations which can be generated by using positional embeddings.
Positional embeddings are learnable positional encoding vectors whose input is the position of the token within the sequence as illustrated in code snippet below.
Note: Updated the logic for relative positional embeddings.
# MultiHeadAttention class
# By the time we get down to the Multi Head Attention class,
# we should have done the following
# 1. Transformer Encoder - Added absolute positions to the input
# 2. Passed down the input to this MHSA class
# -----
# ==== Constructor Section ====
# Add the learnable relative position embeddings to the input embeddings
# Relative Positional Embeddings adapted from huggingface -> Bert transformer
# https://github.com/huggingface/transformers/blob/main/src/transformers/models/bert/modeling_bert.py#L343
self.distance_embedding = nn.Embedding((2 * max_seq_len) - 1, embedding_dim=self.dims_per_head) # ( ( 2 * T) - 1, D)
# ==== forward Function Section ====
batch_size = input.size(0)
timesteps = input.size(1)
# Compute the query, key and value vectors
# Compute the attention scores
attention_scores = torch.matmul(attn_query_slice, attn_key_slice.transpose(-2, -1)) / (self.dims_per_head ** (0.25)) # ( B, H, T, T)
# Compute the range of positions values for query vector
position_ids_l = torch.arange(query_length, dtype=torch.long, device=input.device).view(-1, 1) # row wise vector 2D vector(T, 1)
# Compute the range of positions values for key vector
position_ids_r = torch.arange(key_length, dtype=torch.long, device=input.device).view(1, -1) # column wise 2D vector (1, T)
# Compute the relative distance between the positions of query and key vectors
distance = position_ids_l - position_ids_r # matrix of (T,T)
# Example : with max_seq_len = 8
# compute the positional embedding with the relative distance matrix
# each row of the matrix is the current position of the attention item
# each index on that row, gives the corresponding distance b/t the current attention
# position and the other tokens at the indexes
# see matrix illustration for a max_seq length = 8
# distance -> (T, T)
#
# [[ 0, -1, -2, -3, -4, -5, -6, -7],
# [ 1, 0, -1, -2, -3, -4, -5, -6],
# [ 2, 1, 0, -1, -2, -3, -4, -5],
# [ 3, 2, 1, 0, -1, -2, -3, -4],
# [ 4, 3, 2, 1, 0, -1, -2, -3],
# [ 5, 4, 3, 2, 1, 0, -1, -2],
# [ 6, 5, 4, 3, 2, 1, 0, -1],
# [ 7, 6, 5, 4, 3, 2, 1, 0]]
#
positional_embedding = self.distance_embedding(distance + (self.max_seq_len - 1)) # (T,T, s)
# Einsum Note: This function uses opt_einsum (https://optimized-einsum.readthedocs.io/en/stable/)
# to speed up computation or to consume less memory by optimizing contraction order.
# Generate the relative position scores b/t the query and relative positional embedding
relative_position_scores_query = torch.einsum("bhld,lrd->bhlr", attn_query_slice, positional_embedding) # (B,H,T,s) @ (T,T,s) -> (B,H,T,T)
# Generate the relative position scores b/t the key and the relative positional embeddings
relative_position_scores_key = torch.einsum("bhrd,lrd->bhlr", attn_key_slice, positional_embedding) # (B,H,T,s) @ (T,T,s) -> (B,H,T,T)
# Add the relative query, key relative embeddings to the already computed attention scores
attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key # (B, H, T, T)
Note: There is another way of adding relative positional encodings to the attention (Q, K, V) vectors as outlined in context of machine translation (Shaw et al., 2018), music generation (Huang et al., 2018) and Transformer-XL (Dai et al., 2019) papers.
Transformer Architecture for Classification:
The architecture proposed by Vaswani et al. (2017), has a encoder and decoder block which is used for applications such as machine language translation.
For the purpose of this tutorial, as we exploring sentiment analysis using transformers, we will focus on the transformer encoder block as illustrated below.
Transformer Blocks:
Encoder Block — This can be used in lieu of other sequence based models for classification like sentiment analysis.
Decoder Block — This is always used with an encoder block, like a Seq2Seq model.
Machine translation would be a prime example of using Transformer Encoder/Decoder blocks together.
The output of the encoder block is used as keys(K) and values(V) for the 2nd multi-head-attention present in the decoder block.
Data Loading:
For this exercise, we are loading data for training by chunks. As the IMDB reviews can get quite lengthy, we are breaking it down into over lapping chunks.
For a text length of 2500, and a chunk size of 1024, the chunk are broken by the following indexes, assigned the same label.
- 0–1024
- 512–1536
- 1024–2048
- 1536 — next chunk end or <end of text> ( whichever is smaller)
Sentiment Analysis on IMDB Movie Reviews:
Using a vanilla transformer encoder with relative positional embedding; along with overlapping chunked inputs, we were able to get over 85% accuracy while training it only on 50k IMDB reviews.
Hyper parameters:
I spent a lot of time fine tuning the model, in order for it to infer decent results with the small IMDB movie reviews dataset.
Eventually, after many model training cycles, I went ahead and used Optuna framework for getting a list of optimum hyper parameters.
All the hyper parameters listed below are generated from Optuna.
Updated: Charts / Hyper parameters updated based on the relative positional embedding logic which was revised later.
- Model dimension —196
- Number of attention heads — 2
- Number of Encoder Block layers — 2
- Position-wise Feed-Forward Network dimension —1752
- Optimizer (RAdam) Learning Rate — 1.9625100055440772e-05
- Dropout Layer Probability
* Input Layer Dropout — 0.22734722827397966
* MHA Output Layer Dropout — 0.2555056871254693
* Point-wise Feed-Forward Layer Dropout — 0.2842864429236998
* Attention Score Dropout — 0.361423285117754
* MHA Combined Projection Dropout — 0.23531291320588407
Sentiment Analysis on Sample Fictional Reviews:
_________________________________________________________________
Sentiment: positive | tensor([[-0.4685, -1.3289]])
“It’s distinguished itself just enough to satiate action film fans, entertain future streaming audiences and warrant further merging into the DC universe.”
____________________________________________________________
Sentiment: negative | tensor([[-3.5497, 0.9918]])
“I debated as to whether or not I should tick the spoiler box. Since 99% of this show has probably already been seen by any follower of Scrubs it probably doesn’t come under the category of a spoiler.Not the best of the films to be watched nowadays.”
______________________________________________________________
Sentiment: positive | tensor([[ 21.9607, -17.2190]])
”This film is not great, it’s awesome !”
_________________________________________________________________
Sentiment: negative | tensor([[-6.6202, 4.3013]])
”This film is not great, it’s terrible !”
_________________________________________________________________
Takeaways:
One of the challenges with transformers is trying to use it with little or sparse data.
Overfitting is common problem that is encountered when using transformer based models with a small dataset.
Sample overfitting curve illustrated below.
For reducing overfitting, one can try a combination of various methods such as:
- Use augmented synthetic datasets
- Reduce the dimensions for input embeddings
- Reduce the number of attention heads
- Reduce the number of encoder layers
- Use layer dropout i.e nn.Droput layer in PyTorch
- Early stopping
- Use hyper-parameter tuning framework i.e. Optuna (supports — PyTorch, TensorFlow, XGBoost etc.)
- Pre-trained transformer + fine tuned FC layer
References:
- The Annotated Transformer
- Transformers from scratch
- Relative Positional Encoding
- How Self-Attention with Relative Position Representations works
- Understanding einsum for Deep learning: implement a transformer with multi-head self-attention from scratch
- Manning — Inside Deep Learning, Math, Algorithms, Models
- Positional Encoding
- Position Wise Embedding
- Transformer Positional Embeddings and Encodings
- Getting meaning from text: self-attention step-by-step video
- Peter Shaw, Jakob Uszkoreit, Ashish Vaswani. Self-Attention with Relative Position Representations arXiv:1803.02155.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. Attention Is All You Need arXiv:1706.03762.
- Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M. Dai, Matthew D. Hoffman, Monica Dinculescu, Douglas Eck. Music Transformer arXiv:1809.04281.
- Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context arXiv:1901.02860.
- Ali Hassani, Steven Walton, Nikhil Shah, Abulikemu Abuduweili, Jiachen Li, Humphrey Shi. Escaping the Big Data Paradigm with Compact Transformers arXiv:2104.05704.
- Analyzing and Controlling Inter-Head Diversity in Multi-Head Attention
- Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio. Neural Machine Translation by Jointly Learning to Align and Translate arXiv:1409.0473.
- Wonpyo Park, Woonggi Chang, Donggeon Lee, Juntae Kim, Seung-won Hwang. GRPE: Relative Positional Encoding for Graph Transformer arXiv:2201.12787