Transformers (2017) | one minute summary

This paper Transformed the way we think about attention

Jeffrey Boschman
One Minute Machine Learning
1 min readMay 5, 2021

--

The 2017 paper, “Attention is All You need”, by Vaswani et al (Google) introduced the Transformer architecture, which uses the attention mechanism but without any recurrent neural networks.

Prerequisite knowledge: Attention mechanism

  1. Why? Typical encoder-decoder recurrent models for sequence-to-sequence applications are inherently sequential, which thus prevents the use of parallelization to share information across longer sequences
  2. What? In the proposed architecture, multi-head self-attention is introduced so we can input the entire input sequence at once
  3. How? The input sequence is tokenized and converted to a word embedding space, and then multiple context vectors (a.k.a. heads) are calculated (the weighted sum of a “Values” matrix, where the weights are dependent on the “Key” and “Query” matrices; this is called self-attention) in parallel and concatenated (this yields the output of the encoder)

Side note: The output sequence is still determined one element at a time

--

--

Jeffrey Boschman
One Minute Machine Learning

An endlessly curious grad student trying to build and share knowledge.