Rethinking Self-Attention in Transformer Models

Jul 17, 2020 · 6 min read

The attention mechanism is generally used to improve the performance of seq2seq or encoder-decoder architecture. The principle of the attention mechanism is to calculate the correlation between query and each key to obtain the attention distribution weight. In most NLP tasks, the key and value are the encoding of the input sequence.

[4] studies the real importance and contribution of the dot product-based self-attention mechanism to the performance of the Transformer model.Through experiments, it was found that the random array matrix unexpectedly performed quite competitively, and learning the weight of attention from token to token (query-key) interaction is not so important.

To this end, they propose SYNTHESIZER, a model that can learn comprehensive attention weights without the need for token-to-token interaction.

Basic Concepts

Here we focus on how the basic Self-Attention mechanism works, which is the first layer of the Transformers model. Essentially, for each input vector, Self-Attention produces a vector that is weighted and summed over its neighbours, where the weight is determined by the relationship or connectivity between words.

At the most basic level, Self-Attention is a process in which one vector sequence x is encoded into another vector sequence z . Each original vector is just a block representing a word.The direction of each word vector is meaningful. The similarity and difference between the vectors correspond to the similarity and difference of the words themselves. Its corresponding z vector represents both the original word and its relationship with other words around it.

First, we multiply the vector x by all the vectors in a sequence, including itself. You can think of the dot product of two vectors as a measure of how similar they are.

The dot product of two vectors is proportional to the cosine of the angle between them , so the closer they are in direction, the larger the dot product.

We need to normalize them so that they are easier to use. We will use the Softmax formula to achieve this. This converts the sequence of numbers to a range of 0, 1, where each output is proportional to the exponent of the input number. This makes our weights easier to use and interpret.

Now we take the normalized weights , multiply them with the x input vector, and add their products, we get an output z vector.

Transformer Self-Attention

In transformer self-attention, each word has 3 different vectors, they are Query vector (Q), Key vector (K) and Value vector (V).

They are obtained by multiplying the embedding vector x by three different weight matrices W^ Q , W^ K, W^ V by using three different weight matrices.

According to the embedding vector get q, k , v three vectors,then calculate a score for each vector .

In order to stabilize the gradient, Transformer uses score normalization ,then apply softmax activation function to the score.

Finally multiply softmax weight by value vector to get the weighted score of each input vector .

Every input emits a query and a key and then a dot product attention is performed to decide the attention matrix to determine the relative importance of a single token relative to all other tokens in the sequence. In fact, query, key, and values ​​imply self-attention to simulate a content-based retrieval process, and the core of this process is the interaction between pairwise.


SYNTHESIZER no longer calculates the dot product between two tokens, but learns to synthesize a self-alignment matrix, that is, a synthetic self-attention matrix. At the same time, this paper proposes a variety of synthesis methods and comprehensively evaluates them. The information sources received by these synthesis functions include single token , token-token interaction ,global task information.

Dense Synthesizer

Enter X and generate Y output. Where φ is the sequence length and d is the dimension of the model. First, use the parameterized function F(.) to project the input Xi from the d dimension to the φ dimension.

Where F(.) is a parameterized function that maps R^d to R^φ, and Xi is the i-th token of X. Essentially, using this model, each label predicts the weight of each label in the input sequence, which is actually equivalent to fixing K as a constant matrix . In practical applications, a simple two-layer Feedforward layer and ReLU activation are used:

Where G(.) is another parameterized function of X, similar to V in the standard Transformer model.

This method completely eliminates the dot product by replacing F(.) in the standard Transformer with the synthetic function QK^T.

Random Synthesizer

The attention weight is initialized to a random value, and Q is fixed to a constant matrix. At this time, the entire B is equivalent to a constant matrix, that is

It is called Random in the original paper, and B is initialized randomly, and then you can choose to update with training or not. Formally, Random is actually equivalent to Depthwise Separable Convolution.

These new forms often bring about the problem of increasing parameters, so the number of parameters needs to be reduced by low rank decomposition. For Dense and Random, the original paper proposed two low-rank decomposition forms, called Factorized Dense and Factorized Random, respectively.

Factorized Dense:

The decomposed output can not only slightly reduce the parameter cost of Synthesizer, but also help prevent overfitting. The dependent variable can be expressed as .

Factorized Random:

R can be decomposed into a low-rank matrix R1,R2.

Mixture of Synthesizers :

S(.) is a parameterized synthesis function and ∑ α=1 is a learnable weight.


Experimental results show that SYNTHESIZER can also obtain competitive results with global attention weights, without having to consider token-token interaction or any instance-level information at all.

The randomly initialized SYNTHESIZER achieved 27.27 BLEU on WMT 2014 English-German. In some cases, you can replace the popular and sophisticated content-based dot product attention with the simpler SYNTHESIZER variant without sacrificing too much performance.


  1. Alammar J. The Illustrated Transformer.
  2. Bloem P. Transformers from Scratch. (2019)
  3. Vaswani A. et al. Dec 2017. Attention is all you need.
  4. Tay Y et al. Rethinking Self-Attention in Transformer Models.

The Startup

Get smarter at building your thing. Join The Startup’s +800K followers.


Written by

I am actually student in Data Science at Ecole Polytechnique the leading French institution combining top-level research, academics, and innovation .

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +800K followers.


Written by

I am actually student in Data Science at Ecole Polytechnique the leading French institution combining top-level research, academics, and innovation .

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +800K followers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store