Onikle Paper Summary: An Attention—Free Transformer
original paper: https://onikle.com/articles/435319
Transformer has made significant contributions to the field of NLP since it was proposed by the paper Attention is All you Need. Furthermore, beyond NLP, it is beginning to be widely used in computer vision. The core of the transformer is the matrix mapping of attention by self-attention. This time, I would like to introduce a paper called transformer, attention free transformer, which requires almost no attention.
Self-attention part of the transformer both temporally and spatially is very computationally intensive, for example a typical model using a transformer BERT takes more than 3 days to learn. The attention free transformer is the result of searching for an attention architecture that learns and predicts transformers at high speed while maintaining the accuracy of the transformer at the same level or higher.
Attention free transformer replaces dot product calculation with another calculation method, as dot product calculation takes the longest time among Attention. The original Attention algorithm is as follows:
Attention consists of three vectors: Query, Key, and Value. It calculates the Attention score from Query and Key and weight the score based on the Value, you will gain the word embedding from Attention.
The dimensions of Query and Key are equal, and Value has different dimensions.
The degree of relevance is calculated by the dot product of Query and Key, reduced by dimension, and the softmax function is applied. After that, it is weighted by Value.
The original Attention has the following time complexity and space complexity.
Attention Free Transformer
Attention Free Transformer reduces the amount of computation by replacing the dot-product calculation with the element-wise product calculation. The following is the algorithm:
Weighted average is performed by value for each target position, and the result is taken as a query and element-wise product. The time complexity and space complexity of this algorithm are as follows:
The element-wise product eliminates the square calculation. With a very simple change, the amount of calculation can be reduced by 1 / T times.
In the table, comparison with typical methods is made. This table refers to the paper Attention Free Transformer (AFT). It can be observed that AFT can iterate faster while maintaining a low-test loss. Also, by reducing the amount of space complexity, the GPU usage rate is also dramatically reduced.
Although it is a transformer paper in the Compute Vision field, attention free transformer, the algorithm that can be applied back into NLP. I implemented the code in the sentiment-analysis task of NLP. As a result, the learning speed and the convergence speed have doubled, and the total speed improved by 4x. Also, the accuracy was equal to the original transformer. I will share the code later on the google colab URL
Check out Onikle for the summary of other papers you should read next. You can also see the papers of conferences and the papers compiled by other researchers here:
To advanced attention
This chronicle is a collection of papers that aim for less computation, higher accuracy, and more stable learning…
If you are interested in our service, please register your email address in the following link to get an early access and test our All-new preprint platform that provides stress-free search experience with AI engines.
The preprint search platform for aspiring researchers in Computer Science provides a simple way to find papers with…
Summary made by Kengo Shikama
Translated by Wanonno Iqtyider