A nostalgic trip down Transformer Lane — Sentiment/Text Classification using Transformers

Kaustav Mandal
exemplifyML.ai
Published in
12 min readDec 15, 2022

A tour on the basics of attention, positional encodings and multi-head attention used in a simple transformer based text classification project.

Table of Contents

  • Averaged Attention
  • Self Attention
  • Multi Head Self Attention
  • Absolute Positional Encoding
  • Relative Positional Encoding
  • Transformer Architecture for Classification
  • Sentiment Analysis on IMDB Movie Reviews

Averaged Attention:

Attention is a concept in NLP for deriving the relative importance of the sequence of words in an input sentence.

It’s a weighted average of the output from a score function when applied to the inputs.
Note: The generalized concepts outlined in Bahdanau et al. (2014) are extrapolated in regards to averaged attention.

Steps:

  • Output from a neural network, hᵢ
  • Apply a scoring function
  • Apply softmax to get a probability distribution
  • Multiply the output of softmax with the corresponding inputs to get a weighted average.
Figure 1. Architecture for averaged attention context (Image by Author)

The α is calculated as a softmax over the output of the score function, and the score function is simply the scaled dot product between hᵢ and the context.

Applying a softmax function over the timesteps provides a probability distribution which sums upto 1.
This allows us to filter out the timesteps with low probabilities as they would have a relative importance of zero or close to zero.
From an intuition standpoint, we are choosing to pick the words that are pertinent to the current context.

Note: When computing the score, the dot product is scaled in order to mitigate the vanishing gradients issue.

Figure 2. Equations for weights, score computation (Image by Author)

The context is calculated as the average of the features from the output of the neural network layer (hᵢ).

Figure 3. Averaged features from neural network output (Image by Author)

Self Attention:

Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. — Attention is All You Need, 2017.

Self Attention is based on queries( Q), keys(K) and values(V) which are partial linear representations derived from the input vectors/embeddings.
Note: This is relation to basic concepts outlined in Vaswani et al. (2017).

Figure 4. Scaled Dot Product Attention (Image by Author)

Every input vector is used in 3 ways for the self-attention operation.

  • queries for specific inputs across all vectors
  • keys provide similarity between itself and queries
  • values provide proportional output for the corresponding similarity score between queries and keys; i.e. in essence its the relative importance of the value

All of these linear representations have learnable weights, i.e. nn.Linear layer in PyTorch.

Figure 5. Self-attention — input represented by partial linear representations — Q, K, V (Image by Author) — Adapted from here

Another way to look at this abstraction is through the lens of CNN filters. In a very broad general sense Q,K,V are analogous to 3 CNN filters applied with padding, which work on portions of an image and have learnable weights.
Ignoring the mechanics of how convolutions work / their specific properties, and just focusing at the structure of 3 CNN filters; in a rough sense are similar to Q, K, V in the aspect of partial linear transformations with learnable weights.

Figure 6. (Source — https://twitter.com/martin_gorner)
# A very broad high level analogy 
# Example of 3 x 1D convolution filters.
# Each filter has its own learnable weights

input = torch.randn(20, 8, 96) # (B,T,D)

conv1d_filter_1 = nn.Conv1d(kernel_size=4, in_channels=96,out_channels=32, stride=1, padding='same')
conv1d_filter_2 = nn.Conv1d(kernel_size=4, in_channels=96,out_channels=32, stride=1, padding='same')
conv1d_filter_3 = nn.Conv1d(kernel_size=4, in_channels=96,out_channels=32, stride=1, padding='same')

output_1 = conv1d_filter_1(input.permute(0, 2, 1)).permute(0, 2, 1) # (B, T, D/3)
output_2 = conv1d_filter_1(input.permute(0, 2, 1)).permute(0, 2, 1) # (B, T, D/3)
output_3 = conv1d_filter_1(input.permute(0, 2, 1)).permute(0, 2, 1) # (B, T, D/3)

final_out = torch.cat([output_1, output_2, output_3], dim=2) # (B, T, D)

# Its always good to fact check that I am not totally off base here
# Found these comments under the HuggingFace github repo located at
# https://github.com/huggingface/transformers/blob/main/src/transformers/pytorch_utils.py#L93
#
# "1D-convolutional layer as defined by Radford et al. for OpenAI GPT (and also used in GPT-2)."
# "Basically works like a linear layer but the weights are transposed."
# ☺☺☺☺☺☺

These Q, K, V linear representations allow us to focus on the import aspects of the embeddings generated from the input sequence .
One more way to think about the Q, K, V function — these provide contextualized embeddings.

Figure 7. Transformation of input embeddings into Q,K,V weighted linear representations (Image by Author) — Adapted from here

The dot product operations in the context of (Q,K,V) provides, how each item relates to the other items in the sequence.

Sample heat-map output of attention on a trained model illustrated below.

Figure 8. Self Attention with (Additive Attention) generated by the model for a sentence. Notice the word pair (‘great’, ‘awesome’) has relatively higher scores. Conversely, (‘great’,’effects’) has a lower score, probably due to the training set not having a lot of examples with these together. (Image by Author)

Multi Head Self Attention:

Instead of using a single (Q,K,V) linear projection, the (Q,K,V) is projected ‘h’ times. As we are working with the same dimensions of the input embeddings, the head ‘h’ should be fully divisible by the embedding dimensions.

Figure 9. Multi Head Self Attention (Source — https://arxiv.org/abs/1706.03762)

For example for an input embedding size of 512, we can use 8 heads and each (Q,K,V) linear representation will be of the dimensions 64.

Dimensions of Q, K, V based on input embedding dims=512 and attention heads = 8

Multi head attentions allows us to compute these individual head (Q,K,V) representations in parallel. Additionally, they provide information about the important aspects from representation subspaces in different positions, i.e. different weights get assigned from each head to the each of the words in the input sequence.

Figure 10. Aspects of an object from different view points (Source — https://pixy.org/4537282/)
Figure 11. Visualization of attention weights in single-head attention and multi-head attention. Each head in multi-head attention assigns different weights to each word. (Source — https://www.mdpi.com/2076-3417/11/4/1548)

Note: Self-attention ignores the sequential nature of the input, hence we supplement the input with absolute or relative positional encodings.

Absolute Positional Encoding:

As attention does not provide the order of the time step sequences, we need some way to inject relative or absolute position of the timestep.

One way to add absolute positioning, is to use a combination of cosine, sine frequencies that can represent a single point in time, for each timestep.

Figure 12. Sine and Cosine frequencies generation for positional encodings (Image by Author)

Relative Positional Encoding:

In order to add order information into the attention for the input sequence, we have seen the way to add absolute positional encodings to the inputs.

However for long sequences, we would have to encode the entire sequence. This restricts the length of the input sequence; that the models can only process so much due to system / GPU memory constraints.

Most models generally process between 512 to 1024 sequence length, although there are models that can process longer sequences.

Chunking is one way of solving the issue of processing long sequences, though its not without its own nuances.

For example:

“Employ your time in improving yourself by other men’s writings so that you shall come easily by what others have labored hard for.”
― Socrates

If we split it by 10 words, we get:

Chunk 1 — Employ your time in improving yourself by other men’s writings
Chunk 2 — so that you shall come easily by what others have
Chunk 3 — labored hard for.

On processing these chunks, the positional encodings for chunk 1, chunk 2 and chunk 3 will be the same, i.e. ‘Employ’, ‘so’, ‘labored’ will have position 0. This is commonly referred to as context fragmentation.

Figure 13. Example of a chunked sentence, with same absolute positional encoding, with max sequence length 10

To preserve this context and provide a means to process long sequences; one can make use of relative positional representations which can be generated by using positional embeddings.

Positional embeddings are learnable positional encoding vectors whose input is the position of the token within the sequence as illustrated in code snippet below.

Note: Updated the logic for relative positional embeddings.

# MultiHeadAttention class
# By the time we get down to the Multi Head Attention class,
# we should have done the following
# 1. Transformer Encoder - Added absolute positions to the input
# 2. Passed down the input to this MHSA class
# -----

# ==== Constructor Section ====
# Add the learnable relative position embeddings to the input embeddings
# Relative Positional Embeddings adapted from huggingface -> Bert transformer
# https://github.com/huggingface/transformers/blob/main/src/transformers/models/bert/modeling_bert.py#L343
self.distance_embedding = nn.Embedding((2 * max_seq_len) - 1, embedding_dim=self.dims_per_head) # ( ( 2 * T) - 1, D)

# ==== forward Function Section ====
batch_size = input.size(0)
timesteps = input.size(1)

# Compute the query, key and value vectors
# Compute the attention scores
attention_scores = torch.matmul(attn_query_slice, attn_key_slice.transpose(-2, -1)) / (self.dims_per_head ** (0.25)) # ( B, H, T, T)

# Compute the range of positions values for query vector
position_ids_l = torch.arange(query_length, dtype=torch.long, device=input.device).view(-1, 1) # row wise vector 2D vector(T, 1)
# Compute the range of positions values for key vector
position_ids_r = torch.arange(key_length, dtype=torch.long, device=input.device).view(1, -1) # column wise 2D vector (1, T)

# Compute the relative distance between the positions of query and key vectors
distance = position_ids_l - position_ids_r # matrix of (T,T)

# Example : with max_seq_len = 8
# compute the positional embedding with the relative distance matrix
# each row of the matrix is the current position of the attention item
# each index on that row, gives the corresponding distance b/t the current attention
# position and the other tokens at the indexes
# see matrix illustration for a max_seq length = 8
# distance -> (T, T)
#
# [[ 0, -1, -2, -3, -4, -5, -6, -7],
# [ 1, 0, -1, -2, -3, -4, -5, -6],
# [ 2, 1, 0, -1, -2, -3, -4, -5],
# [ 3, 2, 1, 0, -1, -2, -3, -4],
# [ 4, 3, 2, 1, 0, -1, -2, -3],
# [ 5, 4, 3, 2, 1, 0, -1, -2],
# [ 6, 5, 4, 3, 2, 1, 0, -1],
# [ 7, 6, 5, 4, 3, 2, 1, 0]]
#
positional_embedding = self.distance_embedding(distance + (self.max_seq_len - 1)) # (T,T, s)

# Einsum Note: This function uses opt_einsum (https://optimized-einsum.readthedocs.io/en/stable/)
# to speed up computation or to consume less memory by optimizing contraction order.

# Generate the relative position scores b/t the query and relative positional embedding
relative_position_scores_query = torch.einsum("bhld,lrd->bhlr", attn_query_slice, positional_embedding) # (B,H,T,s) @ (T,T,s) -> (B,H,T,T)

# Generate the relative position scores b/t the key and the relative positional embeddings
relative_position_scores_key = torch.einsum("bhrd,lrd->bhlr", attn_key_slice, positional_embedding) # (B,H,T,s) @ (T,T,s) -> (B,H,T,T)

# Add the relative query, key relative embeddings to the already computed attention scores
attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key # (B, H, T, T)

Note: There is another way of adding relative positional encodings to the attention (Q, K, V) vectors as outlined in context of machine translation (Shaw et al., 2018), music generation (Huang et al., 2018) and Transformer-XL (Dai et al., 2019) papers.

Transformer Architecture for Classification:

The architecture proposed by Vaswani et al. (2017), has a encoder and decoder block which is used for applications such as machine language translation.

For the purpose of this tutorial, as we exploring sentiment analysis using transformers, we will focus on the transformer encoder block as illustrated below.

Figure 14. Transformer encoder block based architecture for classification (Image by Author)

Transformer Blocks:

Encoder Block — This can be used in lieu of other sequence based models for classification like sentiment analysis.

Decoder Block — This is always used with an encoder block, like a Seq2Seq model.
Machine translation would be a prime example of using Transformer Encoder/Decoder blocks together.
The output of the encoder block is used as keys(K) and values(V) for the 2nd multi-head-attention present in the decoder block.

Data Loading:

For this exercise, we are loading data for training by chunks. As the IMDB reviews can get quite lengthy, we are breaking it down into over lapping chunks.

For a text length of 2500, and a chunk size of 1024, the chunk are broken by the following indexes, assigned the same label.

  • 0–1024
  • 512–1536
  • 1024–2048
  • 1536 — next chunk end or <end of text> ( whichever is smaller)

Sentiment Analysis on IMDB Movie Reviews:

Using a vanilla transformer encoder with relative positional embedding; along with overlapping chunked inputs, we were able to get over 85% accuracy while training it only on 50k IMDB reviews.

Hyper parameters:

I spent a lot of time fine tuning the model, in order for it to infer decent results with the small IMDB movie reviews dataset.
Eventually, after many model training cycles, I went ahead and used Optuna framework for getting a list of optimum hyper parameters.

All the hyper parameters listed below are generated from Optuna.

Updated: Charts / Hyper parameters updated based on the relative positional embedding logic which was revised later.

  • Model dimension —196
  • Number of attention heads — 2
  • Number of Encoder Block layers — 2
  • Position-wise Feed-Forward Network dimension —1752
  • Optimizer (RAdam) Learning Rate — 1.9625100055440772e-05
  • Dropout Layer Probability
    * Input Layer Dropout — 0.22734722827397966
    * MHA Output Layer Dropout — 0.
    2555056871254693
    * Point-wise Feed-Forward Layer Dropout — 0.
    2842864429236998
    * Attention Score Dropout — 0.
    361423285117754
    * MHA Combined Projection Dropout — 0.
    23531291320588407
Figure 15. Training vs Validation Accuracy
Figure 16. Training vs Validation Loss

Sentiment Analysis on Sample Fictional Reviews:

_________________________________________________________________

Sentiment: positive | tensor([[-0.4685, -1.3289]])

“It’s distinguished itself just enough to satiate action film fans, entertain future streaming audiences and warrant further merging into the DC universe.”

____________________________________________________________

Sentiment: negative | tensor([[-3.5497, 0.9918]])

“I debated as to whether or not I should tick the spoiler box. Since 99% of this show has probably already been seen by any follower of Scrubs it probably doesn’t come under the category of a spoiler.Not the best of the films to be watched nowadays.”

______________________________________________________________

Sentiment: positive | tensor([[ 21.9607, -17.2190]])

”This film is not great, it’s awesome !”

_________________________________________________________________

Sentiment: negative | tensor([[-6.6202, 4.3013]])

”This film is not great, it’s terrible !”

_________________________________________________________________

Takeaways:

One of the challenges with transformers is trying to use it with little or sparse data.
Overfitting is common problem that is encountered when using transformer based models with a small dataset.
Sample overfitting curve illustrated below.

Figure 17. Example of the model overfitting on one of the training cycles when plotting training vs validation loss

For reducing overfitting, one can try a combination of various methods such as:

  • Use augmented synthetic datasets
  • Reduce the dimensions for input embeddings
  • Reduce the number of attention heads
  • Reduce the number of encoder layers
  • Use layer dropout i.e nn.Droput layer in PyTorch
  • Early stopping
  • Use hyper-parameter tuning framework i.e. Optuna (supports — PyTorch, TensorFlow, XGBoost etc.)
  • Pre-trained transformer + fine tuned FC layer

References:

--

--

Kaustav Mandal
exemplifyML.ai

Software Engineer with an interest in Machine Learning / Data science , ML Ops