Transformer Architecture

Sunke
2 min readApr 29, 2024

--

In the previous articles we had an overview how we reached till transformers and what made this revolution to happen, now let’s begin understanding the architecture of Transformers briefly

  • Input embedding: Converts the input tokens into continuous vectors that the model can work with.
  • Encoder: Consists of multiple layers, each containing a self-attention mechanism and a feed forward neural networks which process the input text.
  • Decoder: Also has multiple layers, each containing a self attention mechanism and a feed forward neural network which generate the output text.
  • Self-attention mechanism: Helps the model understand the relationship between words in a sentence even if they are far apart.
  • Positional encoding: Added to the input embedding to give the model a sense of word order in the sequence.
  • Layer Normalization: A technique used within the encoder and decoder layers to help stabilize the training process.
  • Residual connection: Used in the encoder and decoder layers to help with gradient flow during training and mitigate the vanishing gradient problem.
  • Encoder-Decoder attention mechanism: Used in the decoder to help focus on relevant parts of the input when generating the output
  • Output linear layer: Converts the decoders output into logits for each token in the target vocabulary.
  • Soft max layer: Applied to the logits to produce probabilities for each token in the target vocabulary.

--

--