Transformer Architecture: Attention Is All You Need

Aditya Thiruvengadam
Oct 9, 2018 · 11 min read
“You can’t cram the meaning of a whole %&!$# sentence into a single $&!#* vector!”
The weights for the inputs of attention are learned to understand which inputs it should attend to
Attention as explained by the Transformer Paper
The Scaled Dot Product Multi-Head Self-Attention Architecture
Scaled Dot-Product Attention Equation
The Internal Architecture of Scaled Dot-Product Attention
PE(pos,2i) = sin(pos/100002i/d)PE(pos,2i+1) = cos(pos/100002i/d)
The Transformer Architecture
Google’s Visualization describes it all!

