In the previous articles we had an overview how we reached till transformers and what made this revolution to happen, now let’s begin understanding the architecture of Transformers briefly
- Input embedding: Converts the input tokens into continuous vectors that the model can work with.
- Encoder: Consists of multiple layers, each containing a self-attention mechanism and a feed forward neural networks which process the input text.
- Decoder: Also has multiple layers, each containing a self attention mechanism and a feed forward neural network which generate the output text.
- Self-attention mechanism: Helps the model understand the relationship between words in a sentence even if they are far apart.
- Positional encoding: Added to the input embedding to give the model a sense of word order in the sequence.
- Layer Normalization: A technique used within the encoder and decoder layers to help stabilize the training process.
- Residual connection: Used in the encoder and decoder layers to help with gradient flow during training and mitigate the vanishing gradient problem.
- Encoder-Decoder attention mechanism: Used in the decoder to help focus on relevant parts of the input when generating the output
- Output linear layer: Converts the decoders output into logits for each token in the target vocabulary.
- Soft max layer: Applied to the logits to produce probabilities for each token in the target vocabulary.