Part II: What is Attention?
This is the second part of the “Attention is all you need” But why? series, feel free to check out the content if you need a heads up before jumping into What is attention? Probably you might not need it, anyways let’s dive in!
One Friday evening, since I don’t have any plans for the night, I decided to talk to my make-believe friend Lexi who is a simple language model.
Me: “I love rivers. I can spend an eternity just sitting on its bank, listening to the melody of its flow”
Lexi: “Yah, I agree. But banks can be tiring if you don’t have money in the account”
Maybe I should start making Friday night plans instead of talking to Lexi.
Point is, the context of a word or a sentence is a problem still. What’s with the word Bank anyway?
Let me introduce you to two new problems, Word Sense Disambiguity and Word Sense Discovery. Consider the sentence,
“I hardly ever need to water the plant that grows in my yard because of the leak in the drains.”
The word plant is ambiguous. As a noun, it can be used to mean a botanical lifeform or an industrial building. It also has rarer meanings such as an actor who pretends to be part of an audience and various verbal meanings such as to put in the ground. Here, in the sentence, the word according to the context means a botanical lifeform, but when the language models try to make sense of the sentence it might not understand what meaning to choose exactly for the next sentence predictions. This is Word Sense Ambiguity. Word sense discovery is defined as the task of learning what senses a word may have in different contexts, to achieve these attributes language models need some mechanism.
(You can read more detailed info on the topic if interested in this paper Word Sense Disambiguation: An Overview.)
It is an additional layer that lets a model focus on what is important in a given input sequence. To be able to align words correctly, you need to add this layer to help the decoder understand which word among the input is more crucial in determining its prediction. For this to happen, we give an attention score for each word vector. A higher attention score to a certain word in the sequence indicates that the word will strongly influence the prediction. Now how do we calculate this attention score?
Keys, Queries, Values are the three main terms associated with attention. Let’s look at it this way, you forgot your wallet at the house, you only realised it after stepping out. So you go back and starts searching. You know it’s inside the house, and probably you might have certain places where you are more likely to leave the wallet. So you will look at these places and in one place or the other, voila you found it! Attention somewhat works this way too. You have a query that is associated with some value. So you will find the answer to the query by looking at the key-value relationship that is the place where you have more chances of finding the value.
An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.
Attention uses encoded representations of both input or encoder hidden states and output of decoder hidden states. The N-dimensional Key-Value pairs come from the encoder hidden states, where N is the length of the input sequence whereas the query comes from the decoder hidden states. The dot-product of key and query is calculated to find the measure of similarity between them. The weighted sum of Value is determined by the probability that the key matches the query. This is what happens inside simple dot-product attention, so-called because of the dot product calculation of Q x K^T, where Q is the Query Matrix and K^T is the transpose of Key Matrix.
Scaled dot-product Attention
When simply put Scaled dot-product Attention is essentially a normalised version of dot-product attention. Generally, when Dot-product produces large results it shows Q and K are highly similar. We require to have the values of result in a range within 0 and 1. This is essential because we need these values as probabilities. This can be easily done with the help of the Softmax function. It normalizes all scores to a probability distribution. This way, good probabilistic attention weights are obtained. But with very large vector dimensions large magnitudes will result in very small gradients when passed into the softmax function, therefore we need to scale the values beforehand where scale=1/√ d,
Attention = Softmax(Q x K^T)V.
Where Z is the final attention matrix
Attention can be done in different ways depending on the task at hand. A simple look at how will be like this:
(i)Encoder/Decoder Attention — One sentence look at another sentence for prediction, eg: Translation
(ii) Self Attention (Causal Attention) — The words in a single sentence look at the previous word for possible relationships, eg: Summary generation
(iii)Bi-directional Attention — In a Single Sentence, words look at both previous and future words for making a prediction.
Now we know an idea of what attention is. But what prompted us to say attention is all that we need. The next and the final article on this series will get you to come to the conclusion we prophecied all along.