What are self-attention models?

5 min readMar 3, 2019

In the early days of the NLP, wherever long-term dependencies are involved we have suffered the vanishing gradient problem even with the use of RNN, LSTM. These models handle the sequences of input one by one, word by word which has no parallelization for the process. Transformer achieves parallelization by replacing recurrence with attention and encoding the symbol position in the sequence

Introduction to attention:

Attention function mainly gives importance to some input states in which it has more contextual relation. So, generally the weights for the inputs of attention function are learned to understand which input it should attend to.

For example, for each of the input hidden states x1,x2,….xk, we learn a set of weights w1 to wk which measures how the input help to answer the query and this generates an output.

out = ∑ wixi

In attention, we try to build the contextual vector which gives us a global level information of the all the input i.e we compute the similarity of each input with all other input tokens which gave overall information and context.

In any attention function we have three main keywords Query, Key and Value. Query is nothing but an input word. Key and value is also input token. We take query and each key, compute the similarity between the two to obtain weight. Frequently used similarity functions are Dot product splice, detector. Next step is to normalize these weights using softmax function. And finally these weights in conjunction with the corresponding values and obtain the final attention.

Attention(Q, K, V ) = softmax(QKT / √ dk )V

Scaling factor is sqrt(dim(key)) and is done after the dot product. The queries, keys and values are packed into matrices, so the dot products and weighted sums become matrix multiplies. To keep architecture simple all dimensions are 512

Self-attention:

In the self-attention, we have encode-decode architecture which look very complicated. Below mentioned architecture is based on The paper “Attention is All You Need” was submitted at the 2017 arXiv by the Google machine translation team, and finally at the 2017 NIPS.The encoder is on the left and the decoder is on the right, each is divided into N = 6 layers (so, the gray boxes are actually stacked 6 high), and each layer has some sublayers.

Architecture has three main components:

Positional Encoding
Multi-Head Attention
Feed-forward (position-wise feed forward)

We will discuss in detail of each component

Positional encoding:

It is used to inject the positional information to the model or to capture the order in a sequence. Positional encoding uses fixed sinusoids of different frequencies that get added directly to the input embeddings. This allows easy modeling of relative positions with linear functions. Positional encoding has the same number of dimensions as input embedding.

What are multi-head attentions?

In multi-head , we has same attentions in which query, key and value go through scaled dot attention. But this process is done ‘h’ times to generate h values, making it called multi-head. The W parameter is different each head with same Q,K,V. Output of the multi-head attention is Linear transformation of the concat values of each head. Multi-head allows the model to learn the different representation or context of the word.

Residual connection (passing back the block of data ) around the layers are used to retain the position related information which we are adding to input embedding across network. After the residual connections normalization (Add & norm) is applied on the output of the multi-head attention.

Position-wise feed forward network:

They’re either a two layer fully connected neural network with ReLU applied at each location. Or (and I like this better) they’re actually two 1-kernel-size convolutions applied across position-space: conv → ReLU → conv. The hidden dimension is 2048. You might ask why these sublayers are here.

Decoder:

Decoder has similar components like encode. In decode, the input is the output embedding, offset by one position to ensure the prediction for position(i) is only dependent on the positions previous less than i.

In decoder, self-attention layer enable each position to attend to all previous positions in the decoder, including the current position. To preserve auto regressive property, the leftward information flow(encoder connection) is presented inside the dot-product attention by masking out (set to -∞) that are input for softmax which correspond to this illegal connections.

Multi-head attentions are modified to prevent positions to attend to subsequent positions, these are known as Masked multi-head attention.

After multi-head attention we pass it to feed forward neural network and we normalize the output and send it to softmax layer. Decoder also has residual layers.

Advantages of self attention:

Minimize total computational complexity per layer
Maximize amount of parallelizable computations, measured by minimum number of sequential operations required
Minimize maximum path length between any two input and output positions in network composed of the different layer types . The shorter the path between any combination of positions in the input and output sequences, the easier to learn long-range dependencies.

Applications of self-attention model:

Language Translation
classic language analysis task of syntactic constituency parsing
In BERT, OpenAI GPT which are best models in NLP uses self attention.

References:

Shaw, Peter, Jakob Uszkoreit, and Ashish Vaswani. Self-Attention with Relative Position Representations. arXiv preprint arXiv:1803.02155, 2018.
Im, Jinbae, and Sungzoon Cho. Distance-based Self-Attention Network for Natural Language Inference. arXiv preprint arXiv:1712.02047, 2017.
https://medium.com/@Alibaba_Cloud/self-attention-mechanisms-in-natural-language-processing-9f28315ff905
https://medium.com/@hyponymous/paper-summary-attention-is-all-you-need-22c2c7a5e06

What are self-attention models?

Written by Rahul Meka