In Depth Understanding of Attention Mechanism (Part I) - Origin

FunCry
4 min readFeb 28, 2023

--

Introduction

In recent years, there have been significant advancements in natural language processing, from Google’s BERT in 2018 to later models like GPT and ChatGPT. The reason behind these models’ success is not only the massive amount of data and computing resources but also the transformer model behind them. This article will explore the attention mechanism that underlies the transformer, which has played a crucial role in the success of recent natural language processing models.

Machine Translation

Before we begin, let’s introduce a field of natural language processing: machine translation. Translation is a typical sequence-to-sequence problem, meaning:

  1. The input is a variable-length sequence (e.g., a Chinese sentence: “今天天氣很晴朗”, which literally translates to “It’s sunny today”).
  2. The output is a variable-length sequence (e.g., an English sentence: “It’s sunny today”).

If the Transformer had not yet been invented, how would a computer handle this type of problem?

The intuitive approach would probably be like the figure above, divided into two steps:

  1. Read the input sequence (Chinese sentence) one character at a time.
    In the above figure, each x corresponds to a character, and the computer would first read the first character, store it, then go on to the next character (read “今” first, then read “天”……), and then store all the information in the context vector in the middle.
  2. Calculate the output sequence (English sentence).
    When outputting, the model will output one word at a time based on the information stored in the context vector (e.g., “it”, “‘s”, “sunny”,…).

While it is possible to use this type of model for machine translation, the above examples are short. If I want the machine to handle longer inputs (such as this article), the following problems arise:

  1. The context vector has low dimensionality, it might not be able to remember earlier words.
  2. Reading input word by word is time-consuming.

To address these issues, the following models are proposed.

Neural Machine Translation by Jointly Learning to Align and Translate

(For the input sentence, I use “character” since it is talking about the Chinese language. If what you are dealing with is English, the unit is probablity “word”.)

This paper focuses on addressing the first problem mentioned above. The approach is simple: since the context vector cannot store all the information from the input sentence, we should consider the entire sentence directly when outputting. The actual process involves assigning weights to each input character during each output, and then taking the weighted average of the input characters.

In the proposed model, each input character is first converted into a hidden state (represented by ‘h’ below, which can be imagined as a word-embedding), and the model calculates the attention score based on the word being translated and the hidden state of each characters.

The function for calculating the score is learned by the model.

Let’s take an example where the language model has already output the phrase “It’s sunny” and is now deciding on the next word to output.

To make this decision, the model needs to determine which Chinese characters are most important for the next output word. It uses a function α to calculate the importance score of each character, such as α(“sunny”, 今) = 0.8 and α(“sunny”, 天) = 0.6, among others.

Since the next word is supposed to be “today,” characters related to “今天 (today)” should have a higher score. As a result, each Chinese character may have the following scores:

  • “今”: 0.8
  • “天”: 0.6
  • “氣”: 0.05
  • “很”: 0.01
  • ……

To calculate the final hidden state, the model multiplies the hidden state of each character by its corresponding importance score and then sums them up. (h(今) * 0.8 + h(天) * 0.6 + …… )

This model is essentially a prototype of the Attention mechanism, which involves two steps:

  1. Assigning a weight to each input character.
  2. Computing a weighted sum of the input sentence.

Today, when discussing the Attention mechanism, three components are often mentioned: Query, Key, and Value. Although this paper does not explicitly use these terms, we can infer from the model architecture above that:

  • Query: the character being translated (“sunny”).
  • Key, Value: the hidden state of each character.
  • Attention score: aₜ,ᵢ which is calculated by Query and Key using the Attention function.

To grasp the concept of the Attention mechanism, a possible understanding is that the model itself considers which Key (the Chinese character) holds the most significance for the current Query (the English word being translated).

In this model, Value and Key are actually the same thing.

Conclusion

This article provided a detailed introduction to the origin of the Attention mechanism. Although the model introduced may not be commonly used in natural language processing now, through the previous introduction, we can understand why it is necessary to develop this mechanism, and what the core concept of Attention is.

The next article will introduce the logic and example of the Scaled Dot-Product Attention that everyone is using now.

--

--