Transformers — In Deep Dive. Part 3

Mika.i Chak
4 min readMar 29, 2024

Attention Is All You Need

The world is going crazy with Artificial Intelligence and Generative AI, particularly ChatGPT and Large Language Model in 2023. Before we go into the technical details in the upcoming parts of this series, but let’s start with its ideas and ecosystems.

Transformers architecture

Transformers architecture proposed in the paper “Attention Is All You Need” published in 2017 by Google brought a major change to the entire landscape of Natural Language Processing (NLP). It moved away from recurrent and convolutional network, and to use attention mechanism instead. It is much parallelize and hence require significantly
less time to train.

Although we will expand each sub-layer inside Encoder and Decoder stacks in the subsequent posts in this series, but I can’t wait to provide some brief annotation to the following Transformers architecture diagram already.

Text is first tokenized into individual word tokens and immediately after into token embeddings before being merged with tokens’ position information. Next, it goes into either the Encoder stack on the left or Decoder stack on the right, specifically as input to the attention mechanism sub-layer. In both stacks, there are multiple sub-layers where result coming out…

--

--