Analytics Vidhya
Published in

Analytics Vidhya

Part II: What is Attention?

This is the second part of the “Attention is all you need” But why? series, feel free to check out the content if you need a heads up before jumping into What is attention? Probably you might not need it, anyways let’s dive in!

Photo by Gertrūda Valasevičiūtė on Unsplash

One Friday evening, since I don’t have any plans for the night, I decided to talk to my make-believe friend Lexi who is a simple language model.

Me: “I love rivers. I can spend an eternity just sitting on its bank, listening to the melody of its flow”

Lexi: “Yah, I agree. But banks can be tiring if you don’t have money in the account”

Maybe I should start making Friday night plans instead of talking to Lexi.

Point is, the context of a word or a sentence is a problem still. What’s with the word Bank anyway?

Let me introduce you to two new problems, Word Sense Disambiguity and Word Sense Discovery. Consider the sentence,

“I hardly ever need to water the plant that grows in my yard because of the leak in the drains.”

The word plant is ambiguous. As a noun, it can be used to mean a botanical lifeform or an industrial building. It also has rarer meanings such as an actor who pretends to be part of an audience and various verbal meanings such as to put in the ground. Here, in the sentence, the word according to the context means a botanical lifeform, but when the language models try to make sense of the sentence it might not understand what meaning to choose exactly for the next sentence predictions. This is Word Sense Ambiguity. Word sense discovery is defined as the task of learning what senses a word may have in different contexts, to achieve these attributes language models need some mechanism.

(You can read more detailed info on the topic if interested in this paper Word Sense Disambiguation: An Overview.)


It is an additional layer that lets a model focus on what is important in a given input sequence. To be able to align words correctly, you need to add this layer to help the decoder understand which word among the input is more crucial in determining its prediction. For this to happen, we give an attention score for each word vector. A higher attention score to a certain word in the sequence indicates that the word will strongly influence the prediction. Now how do we calculate this attention score?

Photo by Jaye Haych on Unsplash

Keys, Queries, Values are the three main terms associated with attention. Let’s look at it this way, you forgot your wallet at the house, you only realised it after stepping out. So you go back and starts searching. You know it’s inside the house, and probably you might have certain places where you are more likely to leave the wallet. So you will look at these places and in one place or the other, voila you found it! Attention somewhat works this way too. You have a query that is associated with some value. So you will find the answer to the query by looking at the key-value relationship that is the place where you have more chances of finding the value.

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

Attention uses encoded representations of both input or encoder hidden states and output of decoder hidden states. The N-dimensional Key-Value pairs come from the encoder hidden states, where N is the length of the input sequence whereas the query comes from the decoder hidden states. The dot-product of key and query is calculated to find the measure of similarity between them. The weighted sum of Value is determined by the probability that the key matches the query. This is what happens inside simple dot-product attention, so-called because of the dot product calculation of Q x K^T, where Q is the Query Matrix and K^T is the transpose of Key Matrix.

Scaled dot-product Attention

When simply put Scaled dot-product Attention is essentially a normalised version of dot-product attention. Generally, when Dot-product produces large results it shows Q and K are highly similar. We require to have the values of result in a range within 0 and 1. This is essential because we need these values as probabilities. This can be easily done with the help of the Softmax function. It normalizes all scores to a probability distribution. This way, good probabilistic attention weights are obtained. But with very large vector dimensions large magnitudes will result in very small gradients when passed into the softmax function, therefore we need to scale the values beforehand where scale=1/√ d,

Attention = Softmax(Q x K^T)V.

Where Z is the final attention matrix

Attention can be done in different ways depending on the task at hand. A simple look at how will be like this:

(i)Encoder/Decoder Attention — One sentence look at another sentence for prediction, eg: Translation

(ii) Self Attention (Causal Attention) — The words in a single sentence look at the previous word for possible relationships, eg: Summary generation

(iii)Bi-directional Attention — In a Single Sentence, words look at both previous and future words for making a prediction.

Now we know an idea of what attention is. But what prompted us to say attention is all that we need. The next and the final article on this series will get you to come to the conclusion we prophecied all along.




Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem

Recommended from Medium

Introduction to Multimodal Deep Learning

How transformation can remove skewness and increase accuracy of Linear Regression Model

Effective TensorFlow 2.0: Best Practices and What’s Changed

[CVPR2019/PaperSummary]Improving Pedestrian Attribute Recognition With Weakly-Supervised…

Naive Bayes — Explained

Machine Learning Algorithms from Start to Finish in Python: Linear Regression

How to Deploy Any Machine Learning Model to a Cloud in One Line of Code Part1 v2/v3 Notes — Learning Rates — from SGDR to 1cycle and Super-Convergence

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Sreelakshmi V B

Sreelakshmi V B

NLP | Data Science Enthusiast | Writer | Singer

More from Medium

BERT for Sequence Labelling with Imbalanced Data

Prompt-based Learning

A computers command prompt waiting for input.

But how exactly do transformers work?

An practical introduction to Diff-Pruning for BERT