Transformers

3 min readMay 11, 2024

We’re exploring the realm of Deep Learning, focusing on the pivotal role that “transformers” play in driving advancements in AI, rather than referring to the fictional robots of cinema fame.

Transformer was first proposed in 2017 paper called “Attention is All You Need” by researchers at Google and the University of Toronto.

Transformers employ semi-supervised learning; they are pre-trained in an unsupervised manner with large, unlabeled datasets, and then fine-tuned through supervised training to enhance their performance. Furthermore, Transformers execute multiple sequences in parallel, significantly expediting the training process.

Examples include language translation, document summarization, and auto-completion tasks.

What sets transformers apart from other models?

Attention Mechanism — In transformers, the attention mechanism computes attention scores between each pair of tokens in the input sequence. These attention scores determine how much focus should be given to each token when processing a particular token. For e.g. — In the sentence “the animal didn’t cross the street because it was too tired” the attention mechanism would assign higher weights to “animal” when processing “it,” as it refers to animal.

2. Positional Encoding — Positional encoding is a crucial component of transformers that provides information about the position of words or tokens within a sequence. Since transformers process input sequences in parallel, they lack the inherent understanding of the sequential order of tokens that RNNs possess. Positional encoding addresses this limitation by injecting positional information into the input embeddings. This allows the transformer model to differentiate between tokens based on their positions within the sequence.

3. Parallel Process — Transformers process the entire input sequence in parallel, enabling faster training and inference, especially for long sequences.

It consists of 2 parts –

Encoder — The encoder layer is responsible for capturing the input data and transforming it into a fixed-dimensional representation called context vector. According to the research paper, the encoder is composed of stack of N=6 identical layers. Each layer has 2 sub-layers: Self attention and feed Forward.
Decoder — The Decoder layer takes the context representation generated by the encoder layer to generate the output sequence one element at a time. The decoder is also composed of a stack of N=6 identical layers. Each decoder layer has 3 sub-layers: Self attention, encoder-decoder attention, and Feed Forward.

Pre-Models Transformers models are –

Bidirectional Encoder Representations from Transformer (BERT) — BERT utilized the encoder part of the Transformer.
Generative Pre-trained Transformer (GPT) — GPT uses only the decoder part of the Transformer processing the text in a unidirectional manner from left to right. Language Generative Tasks are performed here.

References —

Finally

Hopefully, you enjoyed reading it. I myself enjoyed writing and coding..

For any queries, contact me on LinkedIn.

Transformers

Written by Ishika Garg