## Discussing the Transformer Model | Towards AI

# Attention Is All You Need — Transformer

**Introduction**

Recurrent Neural Networks(RNNs), Long Short-Term Memory(LSTM) and Gated Recurrent Units(GRU) in particular, have been firmly established as state-of-the-art approaches in sequence modeling and transduction problems. Such models typically rely on hidden states to maintain historical information. They are beneficial in that they allow the model to make predictions based on useful historical information distilled in the hidden state. On the other hand, this inherently sequential nature precludes parallelization, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Furthermore, in these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, which makes it more difficult to learn dependencies between distant positions.

In this article, we will discuss a model named Transformer, proposed by Vaswani et al. at NIPS 2017, which utilizes self-attention to compute representations of its input and output without using sequence-aligned RNNs. In this way, it reduces the number of operations required to relate signals from two arbitrary positions to a constant number and achieves significantly more parallelization. In the rest of the article, we will focus on the main architecture of the model and the central idea of attention. For other details, please refer to [1] and [2] in References.

One thing maybe worth keeping in mind is that the Transformer we introduce here maintains sequential information in a sample just as RNNs do. This suggests the input to the network is of the form *[batch size, sequence length, embedding size]*.

**Model Architecture**

The Transformer follows the encoder-decoder structure using stacked self-attention and fully connected layers for both the encoder and decoder, shown in the left and right halves of the following figure, respectively.

## Positional Encoding

In this work, we use sine and cosine functions of different frequencies to encode the position information:

where *pos* is the position and *i *is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from *2π* to *10000⋅2π*. The authors chose this function because they hypothesized it would allow the model to easily learn to attend by relative positions since for any fixed offset *k*, *PE_{pos+k}* can be represented as a linear function of *PE_{pos}*.

**Encoder and Decoder Stacks**

**Encoder**

The encoder is composed of a stack of *N=6* identical layers. Each layer has two sublayers. The first is a multi-head self-attention mechanism(we will come back to it soon), and the second is a simple fully connected feed-forward network. Residual connections are employed around each of the two sub-layers, and layer normalization is applied in between. That is, the output of each sub-layer is *x+Sublayer(LayerNorm(x))* (This one, adopted by [2], is slightly different from the one used in the paper, but follows the pattern recommended Kaiming He et al in [3]), where *Sublayer(x)* is the function implemented by the sub-layer itself.

**Decoder**

The decoder is also composed of a stack of *N=6* identical layers. In addition to the two sub-layers in the encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack (i.e., where we have the output of the encoder as keys and values). Sub-layers in the decoder follows the same fashion as that in the encoder.

**Masking**

Masks are used before softmax in the self-attention layer in both encoder and decoder to prevent unwanted attention to out-of-sequence positions. Furthermore, in conjunction with the general mask, an additional mask is used in the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. Such a mask has a form of

In practice, the two masks in the decoder can be blended via a bit-wise and operation.

**Attention**

**Scaled Dot-Product Attention**

An attention function can be described as a mapping from a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

More formally, the output is computed as

where *Q, K, V* are queries, keys, and values, respectively; *dₖ* is the dimension of the keys; The compatibility function (softmax part) computes the weights assigned to each value in a row. The dot-product *QK^T* is scaled by *1\over \sqrt{dₖ}* to avoid extremely small gradients for large values of *dₖ*, where the dot-product grows large in magnitude, pushing the softmax function into the edge region.

Some takeaway: mathematically, attention is just focusing on the space where *Q* and *K* are similar(w.r.t. cosine similarity), given they are in the same magnitude — since *(QK^T)_{i,j}=|Q_i||K_j|cosθ*. An extreme thought exercise is a case where both *Q* and *K* are one-hot encoded.

**Multi-Head Attention**

Single attention head averages attention-weighted positions, reducing the effective resolution. To address this issue, multi-head attention is proposed to jointly attend to information from different representation subspaces at different positions.

where the projections are parameter matrices

For each head, we first apply a fully-connected layer to reduce the dimension, then we pass the result to a single attention function. At last, all heads are concatenated and once again projected, resulting in the final values. Since all heads run in parallel and the dimension of each head is reduced beforehand, the total computational cost is similar to that of single-head attention with full dimensionality.

In practice, if we have *hdₖ=hdᵥ=d_{model}*, multi-head attention can be simply implemented using attention with four additional fully-connected layers, each of dimension *d_{model}×d_{model}* as follows

**Tensorflow Code**

We now provide Tensorflow code for multi-head attention. For simplicity, we further assume *Q*, *K*, *V* are all *x.*

## END

I hope you have developed a basic sense of Transformer. To see a complete example with code, you may further refer to [2]

**References**

- Ashish Vaswani et al. Attention Is All You Need
- Guillaume Klein et al. OpenNMT: Open-Source Toolkit for Neural Machine Translation
- Kaiming He et al. Identity Mappings in Deep Residual Networks.