Transformer Architecture (NLP)

4 min readOct 11, 2023

From an Natural Language Processing (NLP) stand point, the traditional RNN structure

did not work for appropriately for lengthy sentences.
prevented parallelization due to its sequential nature.

To overcome these issues, Attention Based Transformer model came into existence.

Transformer models are essentially attention-based models i.e., instead of relying just on the context vector, Self-Attention compares all input sequence members with each other and modifies the corresponding output sequence positions.

Transformer model is based on an encoder-decoder architecture. Both the encoder and decoder consist of sub-layers, described below:

Input Embedding

To start the NLP process, words in a sentence are converted into continuous vectors through word embeddings.
These word embeddings are trainable parameters and capture the semantic meaning of words. Words with similar meanings are mapped to nearby points in the embedding space.

Positional Encoding

Since the Transformer architecture doesn’t inherently understand the order of words, positional encodings are added to the word embeddings to convey information about the position of each word in the sequence.
These positional encodings are vectors with the same dimension as the input vector and are calculated using trigonometric functions.

Self-attention

One of the problems of recurrent models is that long-range dependencies (within a sequence or across several sequences) are often lost. That is, if a word at the beginning of a sequence carries importance for a word at the end of a sequence, the model might have forgotten the first word once it reaches the last word. In simpler terms, self attention helps us create similar connections within the same sentence. Look at the following example:

"She found her lost keys in the kitchen, and it made her day."
it => finding keys
"She found her lost keys in the kitchen, and it was a mess."
it=> kitchen

By changing the emotion from “made her day” → “was a mess” the reference object for “it” changed.

Mechanism:

Self-attention is like a spotlight that moves across each word, focusing on its neighboring words to figure out how important they are.

Attention weights are generated during self-attention and can be visualized as “attention maps.” These maps show how much attention each word pays to every other word in the sequence, providing insights into what the model is focusing on when processing a given word.

Multi-head attention is an extension of self-attention, where multiple self-attention mechanisms operate in parallel. In multi-head attention, the input sequence is processed by several separate self-attention heads, each with its own learned parameters. These multiple heads allow the model to capture different types of dependencies and relationships in the data. Each head can focus on different aspects of the input, enabling the model’s ability to understand and process complex sequences

Benefits and Capabilities:

Self-attention allows the model to give more attention to words that are contextually relevant and less attention to irrelevant words, crucial for understanding the context of a word within a sentence.
Captures long-range dependencies and context, as words that are distant from each other in the sequence can be assigned significant attention weights when their meanings are related.
All the calculation in parallel i.e., all inputs at once. This makes this models incredibly fast, allowing it to be trained with huge amounts of data.

Feed-Forward Neural Network

These layers further process the outputs of the self-attention layer before passing them on to the next encoder or decoder. They help the model capture different levels of abstraction, non — linear interactions between words in the sequence and complex dependencies within the data.

Layer Normalization & Residual Connections

These are applied before and after the self-attention and feed-forward layers (in the Encoder & De-coder), and also to the input embeddings.
By normalizing the embeddings, it helps to ensure that the inputs to the model are centered and have a consistent scale, which is important for stable training.
In the encoder/decoder, it helps ensure that activations neither become too large (leading to saturation and gradients close to zero) nor too small (leading to slow convergence) during training, mitigating the vanishing gradient problem and improving model convergence.

Limitations of Transformers Architecture

Transformer is undoubtedly a huge improvement over the RNN based seq2seq models. But it comes with its own share of limitations:

Attention can only deal with fixed-length text strings. The text has to be split into a certain number of chunks before being fed as input.
This chunking of text causes context fragmentation. If a sentence is split from the middle, then a significant amount of context and meaning is lost.
Training and using large Transformer models require significant computational resources, including powerful GPUs or TPUs. This can be expensive and may limit accessibility to smaller organizations or individuals.
Large-scale Transformer models have a substantial memory footprint, which makes them impractical to deploy on resource-constrained devices.
Fine-tuning can be tricky, and it may require expertise in hyperparameter tuning and domain-specific knowledge to achieve optimal performance.

If you want to learn more about the 3 Types of Transformer Architecture, check out my blog

Types of Transformer Architecture (NLP)

In this article we will discuss in detail the 3 different Types of Transformers, their Architecture Flow & their…

medium.com