Transformer Architecture

2 min readApr 29, 2024


In the previous articles we had an overview how we reached till transformers and what made this revolution to happen, now let’s begin understanding the architecture of Transformers briefly

  • Input embedding: Converts the input tokens into continuous vectors that the model can work with.
  • Encoder: Consists of multiple layers, each containing a self-attention mechanism and a feed forward neural networks which process the input text.
  • Decoder: Also has multiple layers, each containing a self attention mechanism and a feed forward neural network which generate the output text.
  • Self-attention mechanism: Helps the model understand the relationship between words in a sentence even if they are far apart.
  • Positional encoding: Added to the input embedding to give the model a sense of word order in the sequence.
  • Layer Normalization: A technique used within the encoder and decoder layers to help stabilize the training process.
  • Residual connection: Used in the encoder and decoder layers to help with gradient flow during training and mitigate the vanishing gradient problem.
  • Encoder-Decoder attention mechanism: Used in the decoder to help focus on relevant parts of the input when generating the output
  • Output linear layer: Converts the decoders output into logits for each token in the target vocabulary.
  • Soft max layer: Applied to the logits to produce probabilities for each token in the target vocabulary.

