Transformers (Attention Is All You Need) In Depth

2 min readAug 1, 2024

Transformers, in the context of machine learning and artificial intelligence, refer to a type of deep learning model architecture designed primarily for natural language processing (NLP) tasks. They have revolutionized the field by enabling more effective training and performance on a variety of tasks such as translation, summarization, and question-answering.

1. Introduction and History

Origins: The transformer model was introduced in the paper “Attention Is All You Need” by Vaswani et al., published by Google Brain in 2017.
Impact: Transformers have replaced recurrent neural networks (RNNs) and long short-term memory (LSTM) networks in many NLP tasks due to their efficiency and performance.

2. Architecture Overview

Attention Mechanism: The core innovation of transformers is the self-attention mechanism, which allows the model to weigh the importance of different words in a sentence.
Encoder-Decoder Structure: The original transformer model has an encoder-decoder structure. The encoder processes the input sequence, and the decoder generates the output sequence.
Parallelization: Transformers can process all words in a sentence simultaneously, unlike RNNs which process sequentially, making training faster.

3. Components of a Transformer

Input Embeddings: Converts words into vectors of real numbers.
Positional Encoding: Adds information about the position of each word in the sentence, since transformers do not inherently understand order.
Multi-Head Self-Attention: Allows the model to focus on different parts of the sentence simultaneously by using multiple attention mechanisms in parallel.
Feed-forward Neural Networks: Applied independently to each position to introduce non-linearity.
Layer Normalization: Normalizes inputs to each layer to stabilize and accelerate training.
Residual Connections: Helps with gradient flow and allows for deeper models by adding the input of each layer to its output.

4. Self-Attention Mechanism

Calculation: Self-attention involves computing three vectors for each word — Query (Q), Key (K), and Value (V).
Scaled Dot-Product: The dot product of the query and key vectors determines the attention scores, which are then scaled and passed through a SoftMax function to get attention weights.
Weighted Sum: The attention weights are used to compute a weighted sum of the value vectors, producing the final output.

5. Transformer Practical Explanation

Our goal is to translate French sentences into English.

To Continue Click On This Link: Jay Alammar