Geek Culture
Published in

Geek Culture

Transformer : State-of-the-art Natural Language Processing

Natural Language Processing tasks such as question-answering, machine translation, reading comprehension and summarization are typically approached with supervised learning on task-specific datasets.

So, let’s talk about Transformers, My favorite Transformer is Bumble-bee and who is your favorite one ?

I’ m kidding, don’t worry we will talk about only Transformer in NLP. So, we have seen a lot of transformers like Electrical Transformer, Robots in Movies and we will soon see Transformers in NLP, one thing is common is all of them, they are converting something into a particular output(convertible).

A Transformer is a deep learning model that adopts the mechanism of attention, differentially weighing the significance of each part of the input data. It is used primarily in the field of natural language processing (NLP) and in computer vision (CV).

A deep learning model in which every output is connected to every element and weighting between them are dynamically calculated based upon their connections.

First, we see why we are using Transformers in NLP ?

As in the above picture, we can see the comparison between the CNN, RNN and Transformers. We can conclude that Transformers are best model for large sequence to process in NLP. RNN and CNN can process short sequence but not efficient in processing the large sequence.

Attention is all you need.

Before moving towards the architecture of transformer, we should know about Self Attention.

Attention: Transformer learn to weight the relationship between each input item and each output item

Self Attention: Transformer learn to weight the relationship between the each item in input sequence to all items in output sequence.(One-Many relation)

Multi-head Self Attention: Transformer learn multiple ways to weight the relationship of each item in Input sequence to all other items in Input.(Many-Many relation)

High Level Picture of Transformer

Let’s look the high level picture of Transformer, we will consider Transformer as Black Box. This black box is taking input a sentence in Hindi, process that sentence translate into a English. This is example of machine translation.

Architecture of Transformer

In this section, we look a single block of transformer. As in GPT-2, there are large no. of transformer blocks are present. But for our better understanding, we will see the architecture of single block of transformer.

Transformer consists of 3 main part :

  1. Encoder : Encoder are identical in structure. It consist of Multi-Self Attention Layer and Feed-Forward Network. The encoder’s inputs first flow through a self-attention layer — a layer that helps the encoder look at other words in the input sentence as it encodes a specific word.The outputs of the self-attention layer are fed to a feed-forward neural network. The exact same feed-forward network is independently applied to each position.

2. Decoder: Decoder are also identical in structure. It consist of Self Attention Layer, Encoder-Decoder Attention Layer and Feed Forward Layer. These are to be used by each decoder in its “encoder-decoder attention” layer which helps the decoder focus on appropriate places in the input sequence.

3. Embedding: Embedding are numerical representation of words, usually in a shape of vector. This vector will be all zeroes except one unique index for each word.

Classification of Transformer Language Model

  1. Autoregressive Model: These models rely on the decoder part of the original transformer and use an attention mask so that at each position, the model can only look at the tokens before the attention heads.

2. Autoencoding Model: these models rely on the encoder part of the original transformer and use no mask so the model can look at all the tokens in the attention heads. For pretraining, targets are the original sentences and inputs are their corrupted versions.

3. Sequence-to-sequence Model: These models keep both the encoder and the decoder of the original transformer.

4. Multimodal Models: There is one multimodal model in the library which has not been pretrained in the self-supervised fashion like the others.

5. Retrieval-based Models: Some models use documents retrieval during (pre)training and inference for open-domain question answering, for example.


  1. Hugging Face Transformers
  3. “The Illustrated Transformer” by Jay Alammar Github
  4. RASA Developer Rachael Tatman



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store