Transformer Architecture: Attention Is All You Need

Aditya Thiruvengadam
Oct 9, 2018 · 11 min read
“You can’t cram the meaning of a whole %&!$# sentence into a single $&!#* vector!”
The weights for the inputs of attention are learned to understand which inputs it should attend to
Attention as explained by the Transformer Paper
The Scaled Dot Product Multi-Head Self-Attention Architecture
Scaled Dot-Product Attention Equation
The Internal Architecture of Scaled Dot-Product Attention
PE(pos,2i) = sin(pos/100002i/d)PE(pos,2i+1) = cos(pos/100002i/d)
The Transformer Architecture
Google’s Visualization describes it all!

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade