Transformers made easy: architecture and data flow

Published in

Opla

6 min readOct 29, 2019

Dear Transformers fans, sorry but here, we’re not talking about the cartoon series either the movies. However, the transformers we’re dealing with are also heroes but in the Artificial Intelligence world.

A transformer is a Deep Learning model introduced by Google Brain’s team in 2017 in their paper: Attention is All You Need [1]. It is an evolution over the famous sequence-to-sequence models used mostly as transduction models that map a sequence of data of type A into another sequence of type B depending on the final task. In NLP it could be used for translation, summarization, dialog, etc. If you’re new to Deep Learning you can find a short yet clear introduction to sequence-to-sequence learning in our previous article Chatbots approaches: Sequence-to-Sequence VS Reinforcement Learning.

But what’s wrong with sequence-to-sequence?

Sequence-to-sequence (seq2seq) models were introduced in 2014 and have shown great success as unlike previous neural networks, seq2seq models take as input a sequence of numbers, words or any other type of data. Much of this success is due to the fact that most data in the real-world come into sequences and the most salient case is the text data.

Seq2seq neural networks are composed mainly of two elements: an encoder and a decoder. The encoder is fed up with the input data. It encodes data into a hidden state called context vector. Then comes the turn of the decoder that takes that context vector and decodes it into the desired output.

A major drawback of this architecture is that a single context vector cannot embed all the important information as well as the dependencies between the words, especially when the input sequence is too long.

In 2015, the seq2seq models were improved using the famous attention mechanism. The global architecture of the model remained the same. However, instead of feeding the decoder only with the final context vector, we feed it with all the resulting context vectors that come out of all the encoder RNNs. The resulting architecture is shown in the graph below:

Sequence-to-sequence model with an attention mechanism

Another drawback in seq2seq models is related to their efficiency during training and inference. The learning process, as well as the inference is very time-consuming as we cannot encode any word in the sequence unless we have encoded all its previous words. This means, there is no hope to parallelize computing in such models.

Attention is All You Need [1]

Here come the new Transformers models in the Attention is All You Need Paper. The title says it all: let’s remove the RNNs and just keep attention.

Just as seq2seq, transformers do transform sequences of type A into sequences of type B. The difference is that they do not use any recurrent networks (GRU, LSTM, etc).

“The Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence aligned RNNs or convolution.” [1]

The other difference is that unlike seq2seq, in the transformer model the words can be encoded in parallel and independently. Each word crosses a preprocessing step where it is represented by a word embedding (a word vector). Since we don’t deal with sequences anymore, word order should be saved somewhere. To this end, during the preprocessing step, the position of each word is encoded in its embedding vector. The architecture of the encoder part is shown in the graph below. We will explain the decoder part later.

The architecture of the Encoders part of the transformer

Each encoder stack also receives information about the other elements via its Self-attention sublayers. This allows capturing the relationships between words in the sentence. We notice that each encoder is a stack of sublayers. These sublayers are mainly composed of an attention layer and a simple fully connected feed-forward network. The stack size is a parameter set to 6 in the original paper and to 2 in the graph above for simplicity.

Now let’s move to the remaining part of the architecture to cover all the flow. The global model is illustrated in the graph above. The decoder is also composed of a stack of sublayers, set to 2 in the graph to make it simple. The decoders’ sublayers are similar to those of the encoder: a self-attention layer that takes as input the output of the preceding decoders and a feedforward network. They also have an additional layer which is the Encoder-Decoder attention referred to in the paper as the Masked multi-head attention that receives as input the output of the multiple encoders.

The decoder is designed to produce sequentially the output scores (converted later to words) until the token <EOS> (End Of Sentence) is obtained. In each step, we need the output value of the preceding decoding steps. However, during the training, we already have the target values: Output 1, Output 2, … This allows all the steps from 0 until <EOS> to be performed in parallel and only during the training. We just need to mask, at each step, the future output values by setting them to zero for example. This is why this layer is called Masked Attention.

How do we turn the output scores into words?

Each decoder stack output is a vector of values of the same dimension as the word embeddings used as input. At each step of the inference, this output vector should be converted to a word. Two additional layers: the linear layer and the softmax layer do this job. The output flow is illustrated in the figure below.

Transformer inference: the linear and softmax layers

The linear layer, which is a fully connected network, projects the output vector on a high dimensional vector space where each dimension stands for a word in the vocabulary known by the model and each value stands for the score of the corresponding word. This reminds us of the famous bag-of-words language models. Here, the vocabulary holds all the words that the model has seen during the training. The resulting vectors of the linear layer are called logits vectors.

Those logits vectors are then transferred to the softmax layer that converts the values of each index into probabilities. The probability in index x tells what chance does the world with index x have to appear as the next output word. You can guess it now, we need just to perform an Argmax to find the index with the highest probability. Once we have it, we go back to the vocabulary to find out which word corresponds to the chosen index.

So what?

A legitim question right now is: ok so what? How would this new model change our NLP framework or why should it work better?

Actually, the main strength of the Transformer models is as we said its low training time compared to its predecessors. Another important breakthrough that you probably have heard about is the BERT model which stands for Bidirectional Encoder Representations from Transformers. BERT is a pre-trained model on very large text data and that is able to perform different NLP tasks and of course, it is possible to retrain it in order to make it more performant on the target task. BERT uses Transformers with a bidirectional attention mechanism. The ability to use a pre-trained model in NLP is a very important step forward that can help a lot with the lack of training data and computing servers.

References

[1] VASWANI, Ashish, SHAZEER, Noam, PARMAR, Niki, et al. Attention is all you need. In : Advances in neural information processing systems. 2017. p. 5998–6008.