What Exactly Is Happening Inside the Transformer

Published in

The Startup

9 min readOct 4, 2020

The transformer, with great capability in handling long-range dependencies thanks to its special design of positional encoding, self-attention mechanism and encoder-decoder architecture, has changed the NLP field. In this story, we will take a detailed look at what is happening in each part of the transformer, with its corresponding tensorflow code.

This story contains seven Sections:

Section 1 explains about embedding and positional encoding.

Section 2 contains full code of Multi-head Self-attention.

Section 3 contains detailed explanation on what is happening in transformer encoder block, including the concept of “multi-head”, “self-attention” and “padding mask”, how attention is calculated, and the graph of each layer with corresponding code.

Section 4 explains the difference of two decoder multi-head attention layers from that in the encoder, as well as what “look ahead mask” is. It also contains the graph of each layer with corresponding code.

Section 5 is about the final layer in Transformer.

Section 6 puts all pieces explained in Section 1 to 5 together, and summarise the function of each major block with code.

Section 7 explains how data preparation, single training step, and prediction are done, with an example of machine translation.

1. Embedding and Positional Encoding

Transformer converts token indices into vector representations thought embedding and positional encoding.

Positional encoding is used instead of convolutional layers or recurrent mechanism to record relative position information of data points in sequence. It is a matrix with shape of (input sequence length, embedding size), which matches the embedding matrix shape. And it can be calculated through the following two equations. In the positional encoding matrix, each value depends on its relative location in both sequence dimension and embedding dimension.

Equation Reference: https://www.tensorflow.org/tutorials/text/transformer

Below is a plot of value distribution of positional encoding for maximum position=100 and embedding size=256.

How value distribution looks like when maximum position=100 and embedding size=256

After adding up the positional encoding and the embedding, the result array is ready to be processed by the next layers in transformer. Below is an example of positional encoding, embedding and their sum.

Add up positional encoding and embedding to get the final features

Embedding layer is trainable while positional encoding matrix remains the same. It’s OK to choose a much longer maximum position than the input sequence length. And when the input sequence length is shorter than the maximum position, only the encoding values up to the input sequence length are to be used.

When maximum position is longer than input sequence length

Here is the code for calculating positional encoding:

2. Attention Mechanism

Here is the full code of Multi Head Self Attention. The details will be explained in Section 3 and Section 4.

3. Encoder Block

3.1 Encoder Multi Head Self Attention Block

The encoder takes the transformed input array (embedding + positional encoding), and sends it to the Multi Head Self Attention block. Below is an illustration with code. The values in round brackets represent the shape of arrays. (Please refer to code in Section 2.1)

(1) Feed three copies of transformed encoder input to three different fully connected layers to get query, key, and value, and their dimensions remain unchanged.

(2) Split head along the embedding dimension to get multi-head query, multi-head key, and multi-head value (here we need to make sure the embedding size can be fully divided by number of heads, i.e., embedding size % number heads == 0). The split head size is also called depth.

(3) Multiply multi-head query with multi-head key, and scale the result score.

(4) Add padding mask to the scaled score.

Remember that the input array is formed by stacking sequences of different lengths together, and they are post padded with 0s to achieve the same length? (Refer to Section 7.1 for more details) We want the neural network to ignore those values in “empty locations”. Padding mask can help. It first converts 0s to 1s, and non-0 values to 0s, and then multiply with a large negative value. After adding the mask to the scaled score, the original empty locations are now replaced with large negative values which will be ignored by softmax activation function.

(5) Apply softmax on scaled score, and multiple it with multi-head value.

(6) Combine split heads together to recover the embedding size, and apply the final fully connected layer to get the output array.

3.2 The Rest Part of the Encoder Block

Here is what is happening in the Encoder Block:

(1) Feed three copies of transformed encoder input (embedding + positional encoding) to the Multi Head Self Attention block (explained in the Section 3.1).

(2) Apply a dropout layer.

(3) Form a skip connection: add transformed encoder input to the output of (2) to get skip_conn_1.

(4) Apply layer normalisation to skip_conn_1.

(5) Apply two fully connected layers followed by a dropout layer on skip_conn_1.

(6) Form a skip connection: add skip_conn_1 to the output of (5) to get skip_conn_2.

(7) Apply one more layer normalisation to get the Encoder Output.

Here is the code for the encoder block (please refer to Section 6 for the full code of Transformer):

4. Decoder Block

4.1 Decoder Multi Head Self Attention Block 1

The Decoder Multi Head Self Attention Block 1 is almost the same as that in encoder (refer to Section 3.1) except the input sequence length (to be explained in Section 7.2 and Section 7.3) and the mask. (Refer to code in Section 2.1)

Look ahead mask is used here instead of padding mask. It can make the neural network ignore the future values on the decoder input (the target input).

4.2 Decoder Multi Head Self Attention Block 2

In the Decoder Multi Head Self Attention Block 2, we use the output from Decoder Multi Head Self Attention Block 1 as query, and the encoder output as key and value. And padding mask is used here again. (Refer to code in Section 2.1)

4.3 The Rest Part of Decoder Block

The Decoder Block:

(1) Feed the transformed decoder input (embedding + positional encoding) to Decoder Multi Head Self Attention Block 1 (explained in the Section 4.1).

(2) Apply a dropout layer.

(3) Form a skip connection: add the transformed decoder input to the output of (2)

(4) Apply layer normalisation to get skip_conn_1.

(5) Use skip_conn_1 as query, and encoder output as key and value in Decoder Multi Head Self Attention Block 2 (explained in the Section 4.2).

(6) Apply a dropout layer.

(7) Form a skip connection: add skip_conn_1 to the output from (6)

(8) Apply layer normalisation to get skip_conn_2.

(9) Apply two fully connected layers followed by a dropout layer on skip_conn_2.

(10) Form a skip connection: add skip_conn_2 to the output from (9)

(11) Apply dropout and layer normalisation to get the Decoder Output.

Here is the code for the decoder block (please refer to Section 6 for the full code of Transformer):

5. Final Layer

The final layer of transformer is a fully connected layer which projects decoder output from embedding space to vocabulary probability space. To extract the predicted token, we need to select the index corresponds to the maximum probability, and convert the selected index back to token (refer to Section 7.3 for more details).

6. Put All Parts Together

As a summary, a transformer contains five major parts:

(1) Input Transformation for Encoder and (2) Input Transformation for Decoder: they get data ready by mapping token indices to vectors representations through embedding and positional encoding. The transformation block for encoder and decoder can be different, because the source text and the target text can be from different domains with different vocabulary size.

(3) Encoder: it extracts contextual features from input sequence though multi-head self-attention mechanism.

(4) Decoder: it takes in encoder output as well as the previous prediction (decoder input) to generate the next prediction in embedding space. (Refer to Section 7.3)

(5) Final Layer: it maps decoder output from embedding space to vocabulary probability space.

7. How Transformer Works

Now let’s explore how transformer works using an example of machine translation from English to Ukrainian.

7.1 Data Preparation

First of all, we prepare input and target arrays from two language domains, English as source and Ukrainian as target. Both input and output sentences are tokenised, and the tokens are indexed. Note that there are only one encoder input array, but two arrays for decoder: the target output array is one time step forward of target input array, and this can be achieved by adding a start token at the beginning of the target sequence to get the target input, and an end token at the end of the target sequence to get the target output. In order to make the sequences the same length, post padding with 0s can be applied.

Depending on the specific problems and applications, sometimes the raw text needs more cleaning before going through the above mentioned data preparation process, such as converting letters into lowercase, removing special characters and stop words.

7.2 Training

The training process is pretty straightforward.

(1) Input the entire encoder input array and the entire target input array to transformer to get the vocabulary probability.

(2) Compare vocabulary probability array with the target output array, and calculate Sparse Cross Entropy Loss.

(5) Back propagate the loss.

This process repeats for each mini-batch of training dataset for each epoch, until the model is sufficiently trained.

Below is the code for mini-batch training step. To train on full dataset for multiple epochs, just simply run the function “train_step” and loop over all batches for multiple epochs, and record the loss per batch.

7.3 Prediction

Prediction is a bit more tricky:

(1) Feed the entire encoder input array and the decoder input array (initialised with the start token) into transformer, and get the full vocabulary probability array.

(2) Only extract the last prediction in the full prediction array.

(3) Apply argmax to get the index corresponding to the maximum probability.

(4) Convert the predicted index to the corresponding token.

(5) Append the predicted token to the prediction result.

(6) Append the predicted index to the end of the decoder input array.

This process repeats until it reaches the maximum output sequence length or the predicted token is the end token.