Transformers (2017) | one minute summary

This paper Transformed the way we think about attention

Published in

One Minute Machine Learning

1 min readMay 5, 2021

The 2017 paper, “Attention is All You need”, by Vaswani et al (Google) introduced the Transformer architecture, which uses the attention mechanism but without any recurrent neural networks.

Prerequisite knowledge: Attention mechanism

Why? Typical encoder-decoder recurrent models for sequence-to-sequence applications are inherently sequential, which thus prevents the use of parallelization to share information across longer sequences
What? In the proposed architecture, multi-head self-attention is introduced so we can input the entire input sequence at once
How? The input sequence is tokenized and converted to a word embedding space, and then multiple context vectors (a.k.a. heads) are calculated (the weighted sum of a “Values” matrix, where the weights are dependent on the “Key” and “Query” matrices; this is called self-attention) in parallel and concatenated (this yields the output of the encoder)

Side note: The output sequence is still determined one element at a time

Transformers (2017) | one minute summary

This paper Transformed the way we think about attention

Written by Jeffrey Boschman