Transformers (2017) | one minute summary
This paper Transformed the way we think about attention
Published in
1 min readMay 5, 2021
The 2017 paper, “Attention is All You need”, by Vaswani et al (Google) introduced the Transformer architecture, which uses the attention mechanism but without any recurrent neural networks.
Prerequisite knowledge: Attention mechanism
- Why? Typical encoder-decoder recurrent models for sequence-to-sequence applications are inherently sequential, which thus prevents the use of parallelization to share information across longer sequences
- What? In the proposed architecture, multi-head self-attention is introduced so we can input the entire input sequence at once
- How? The input sequence is tokenized and converted to a word embedding space, and then multiple context vectors (a.k.a. heads) are calculated (the weighted sum of a “Values” matrix, where the weights are dependent on the “Key” and “Query” matrices; this is called self-attention) in parallel and concatenated (this yields the output of the encoder)
Side note: The output sequence is still determined one element at a time