Attention to Transformer, please!

The Transformer architecture is considered a breakthrough and has been largely employed in NLP and, more recently, in Computer Vision areas. So, in this article, we are going to focus on Transformers architecture and hack it.

Gabriel Oliveira
Semantix
6 min readNov 11, 2021

--

Photo by Alina Grubnyak on Unsplash

Natural Language Processing (NLP) has been gaining attention in both Academy and Industry circles thanks to the success of recent language models. In particular, BERT [1] and GPT [2, 3] have become the new celebrities in the NLP field, presenting state-of-the-art results in various Natural Language Understanding and Natural Language Generation tasks, respectively. And what do these models have in common? Both are Transformers [4] based language models that can be pre-trained in a self-supervised manner.

The Transformer architecture was proposed in 2017 by Vaswani et al. aiming to get away with traditional recurrent neural networks (RNN) and rely entirely on attention mechanisms to process sequential data. Attention allows this model architecture to capture the relationship between elements in a sequence regardless of their distance. Also, the Multi-Head Attention makes the Transformer parallelizable reducing the training cost significantly compared to traditional RNNs.

Figure 1 shows the overall Transformer architecture. It is composed of Encoder and Decoder modules. The Encoder is a stack of N blocks, each one comprehending a Multi-Head Attention layer followed by a layer normalization and residual connection (Add & Norm layer), a Feed-Forward layer and another Add & Norm layer. The output of each Encoder block is passed as input to the next block, and the output of the last Encoder block is passed to the first Decoder block as an input. Except for the Masked Multi-Head Attention and Add & Norm layers, the Decoder block is the same as the Encoder block. On top of the Decoder, there is a linear layer followed by a softmax and both are used according to the task.

Figure 1: The Transformer — model architecture. Diagram copied from “Attention Is All You Need”, Vaswani et al.

Self-Attention

In Figure 2, we show the Equation of Self-Attention. As it can be seen, the attention scores are computed based on a set of queries, keys and values packed together into matrices Q, K and V, respectively. Note that this block doesn’t have any learnable weights. Considering NLP tasks, in general, the queries, keys and values are the embeddings of the words that compose a sentence (or a sequence of tokens). We compute the attention score for each pair of words, which makes the Transformer have quadratic complexity.

Figure 2: Equation of self-attention.

The goal of Self-Attention is to capture the ‘long-term’ information and dependencies between words, as illustrated in Figure 3. We can see that, in this case, the word “it” is more related to “street”, but the other words also contribute to the context of the word “it”.

Figure 3: Attention scores for the word “it”.

Multi-Head Attention is a parallel way to compute the Self-Attention scores. In Figure 4, we show on the left the Self-Attention structure and, on the right, we show the Multi-Head Attention. As we can see, Multi-Head Attention splits the queries, keys and values into h parts so that each head computes the attention scores for only one part. Note that each head applies linear layers to Q, K and V before passing these embeddings to self-attention. This way, each head learns its own representation of queries, keys and values. The heads can work in parallel to complete the computation faster and, in the end, all head output matrices are concatenated and a linear projection is applied.

Figure 4: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel. Diagram copied from “Attention Is All You Need”, Vaswani et al.

In order to illustrate how Multi-Head Attention can be implemented, we present a short code below. As we can see, in this code we use the parameter mask in the attention method, it is used in the case of Masked Attention as in Decoder. In sum, it just ignores the attention scores in the masked positions.

Encoder & Decoder

Self-Attention operates over every pair of elements within a sequence, therefore it is order-invariant, which means that the attention score for the word “I” from the word “popcorn” is the same in the sentences “I eat popcorn at the cinema” and “cinema eat I at the popcorn”. Thus, in order to take the word position into account, Transformer implements a positional encoder. Originally, it was proposed as a sinusoidal function, but recent works have shown that learning this encoding representation is a better approach. The positional encoding is a matrix of weights (learned or not) that is summed to the input embedding.

Having implemented the Multi-Head Attention and understanding the idea of Positisional Encoding, we can move on and finally implement the Transformer Encoder and Decoder modules.

In our implementation, we consider positional encoding as a linear layer whose weights will be learned. Also, the Feed-Forward layer consists of two linear layers with an activation function RELU between them. Still, note that every batch must have input sequences with the same length to construct the input embedding, then we need to add [PAD] tokens to complete the sentences or to crop those that exceed the maximum length. However, we don’t want that the [PAD] tokens interfere with our model performance, so we mask out them. The remaining code implements the model as illustrated in Figure 1.

The Decoder architecture is quite similar to the Encoder except for the Masked Multi-Head Attention module. The mask used in the Decoder depends on the task because in some tasks the model can attend to all elements in the Decoder input sequence, in others, it can only attend to a part of the elements. In Figure 5, we illustrate the masking process regarding the machine translation task. Note that the model can attend to all elements in the Encoder input, but it can only attend to the elements in the left context to predict the next word. Thus, we opt to pass the mask as a parameter to the model. It is worth noting that the mask is applied to just the first Multi-Head Attention module, and after that, the following layers will attend to the correct elements, so there is no need to apply the mask again.

Figure 5: Encoder-Decoder architecture for a machine translation model.

Finally, we can construct the whole Transformer architecture. We can already train this model to learn to translate texts with a good performance. Also, we can leverage the Encoder and Decoder modules to construct more powerful models such as BERT, T5 [5], GPT, etc.

References

[1] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, https://arxiv.org/pdf/1810.04805.pdf

[2] Language Models are Unsupervised Multitask Learners, https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

[3] Language Models are Few-Shot Learners, https://arxiv.org/pdf/2005.14165.pdf

[4] Attention Is All You Need, https://arxiv.org/pdf/1706.03762.pdf

[5] Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, https://arxiv.org/pdf/1910.10683.pdf

--

--

Gabriel Oliveira
Semantix
Writer for

I am a PhD student in Computer Science at University of Campinas 💻 My research interests are NLP and Computer Vision.