Decoding the Encoder, Decoder of Transformers

Karteek Menda
9 min readSep 19, 2024

--

Hello Aliens

In this blog, I will unravel each of the components of the Encoder and Decoder sub components of Transformers architecture. To understand transformers, we first must understand the attention mechanism to get an intuitive understanding of the attention mechanism. We’ll need to prime in a model with an arbitrary input, and a model will generate the rest. The model is a little dark, but what’s interesting is how it works: as a model generates text word by word, it has the ability to reference or tend towards that is relevant to the generated word. How the model knows which were to attend to is all learned while training with backpropagation. RNN’s are also capable of looking at previous inputs too, but the power of the attention mechanism is that it doesn’t suffer from short-term memory. This is still true for GRUs and LSTM, although they do have a bigger capacity to achieve longer term memory, therefore having a longer window to reference from. This power was demonstrated in the paper “Attention is all you need” when the authors introduce a new novel neural network called the Transformers, which is an attention-based encoder-decoder type architecture.

The Transformer — Model Architecture [Image Source]

ENCODER

Feeding the input into a word embedding layer, which is essentially a lookup table to obtain the learn factor of each word’s representation, is the initial stage. Because neural networks are trained on numerical data, every word can be represented as a vector with continuous values. The next step is to inject positional information into the embeddings because a transformer encoder has no recurrence like RNN’s. The input embeddings need to have positional information added to them. This is done using positional encoding. The authors came up with a clever trick using sine and cosine functions. For every odd time step, create a vector using the cosine function, and for every even time step, create a vector using the sine function. Then add those vectors to their corresponding embedding vector. This successfully gives the network information on the positions of each vector. The sine and cosine functions were chosen because they have linear properties the model can easily learn to attend to.

Every input sequence must be mapped by the encoder layer into a continuous representation that includes all of the knowledge the layer has acquired for that particular sequence. It contains two submodules, multi-headed attention followed by a fully connected network. After a layer normalization, every single of the two sections has residual connections around it.

(Left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel [Image Source]

To break this down, the multi-headed attention module has a specific attention mechanism called self-attention. A model can correlate each word in the input with other words. This is due to self-attention. It’s possible that the model can learn to associate the word and also learn that word structured in a pattern (typically a question) so respond appropriately. To achieve self-attention, we feed the input into three distinct, fully connected layers to create the query key and value vector. The query, key, and value concepts are from the retrieval system. When you type a query in YouTube search option to look for a video, for instance, the search engine maps your query against a set of keys, such as the title, description, etc. tagged with videos, then displays you the best video that matched.

The self-attention calculation in matrix form [Image Source]

To create a score matrix, the queries and keys undergo dot product matrix multiplication. How much importance a word should have on other words is calculated by the scoring matrix. As a result, every word in the time step will be assigned a score corresponding to other words. The higher the score, the more the focus is. This is the key-query mapping method. To get the final result, the scores are then divided by the square root of the dimensions of the key. This makes it possible for greater stability in gradients since values will explode when multiplied. Next, we take the softmax scaled score to get the attention weights, which give you probability values between 0 & 1. When doing the softmax, the larger values are pushed towards 1, and the smaller ones are reduced towards 0. This allows the model to be more confident on which words to attend to, and then you take the attention weights and multiply them by your value vector to get an output vector. If the model’s softmax scores are closer to 1, it will continue to try to learn more important words. The lower scores will be drowned to the lesser scores.

To process, you pass the result vector into a linear layer. You must divide the query, key, and value into addition vectors before applying self attention to the split vectors, each of which goes through the same self attention procedure separately, in order to turn this into a multi-headed attention computation. We call each self-attention process a head. Before entering the last linear layer, the resultant vectors from each head are concatenated into a single vector. Otherwise, every head would learn something new, increasing the representation power of the encoder model.

Within a transformer network, the multi-headed attention module calculates the input’s attention weights and produces an output vector carrying encoded information regarding how each word in a sequence should attend to every other word. The multi-headed attention output vector is then added to the initial input in the following phase. This is called a residual connection. The residual connection’s output undergoes layer normalization. The normalized residual output goes into a point-wise feed-forward network (which consists of two linear layers and a Relu activation in between the linear layers). That result is then subsequently standardized and added to the point-wise feed-forward network’s input once again.

The layer normalizations are used to stabilize the network, which results in substantially producing the training time necessary, and a point-wise feed-forward layer is used to further process the attention output, potentially giving it a richer representation. All these operations are for the purpose of encoding the input into a continuous representation with attention information. This will improve the decoder’s ability to focus on the relevant words in the input as it decodes. To further encode the data, we can stack the encoder for n times, giving each layer a chance to choose up a new attention representation. Therefore, improving the transformer network’s ability for predictions.

DECODER

The decoders job is to generate text sequences. The decoder has similar sublayers as the encoder. It has two multi-headed attention layers, followed by feedforward, followed by layer normalization, followed by a linear layer, and then a softmax layer. These sublayers behave similarly to layers in the encoder, but each multi-headed attention layer has a different job. It’s capped off with a linear layer that acts like a classifier and a soft max to get the word probabilities. The decoder is auto-regressive. It takes in the list of previous outputs as inputs as well as the encoder outputs that contain the attention information from the input. When the “end” token is generated, that is when the decoder stops predicting the next token.

The positional embeddings get fed into the first multi-headed attention layer, which computes the attention score for the decoders input. This multi-headed attention layer operates slightly differently. Since the decoder is autoregressive and generates the sequence word-by-word, you need to prevent it from looking into future tokens. A way to stop calculating attention scores for future words is required. They called it masking. You add a look-ahead mask to stop the decoder from seeing tokens that are in the future.

The mask is applied after the scores have been scaled and before the softmax is determined. The mask is a matrix that’s the same size as the attention scores, filled with values of 0’s and -infinity. When you add the mask to the scale attention scores, you get a matrix of scores with the top right triangle filled with negative infinities. The reason for this is that once you take the softmax of the mask scores, the negative infinity gets zeroed out, leaving a zero attention score for future tokens. This essentially tells the model not to focus on those future words. This is the only difference between the two attention mechanisms in decoder architecture. This layer still has multiple heads that the masks are being applied to before getting concatenated and fed through a linear layer for further processing.

A mask output vector with information on how the model should pay attention to the decoder’s inputs is the result of the first multi-headed attention process. Now in the second multi-headed attention layer, the encoders output are the queries and the keys, and the first multi-headed attention layer outputs are the values. This process matches the encoder's input to the decoder's input, allowing the decoder to decide which encoder input is relevant to put focus on.

The output of the second multi-headed attention goes through a point-wise feed-forward layer for further processing. The output of the final point wise feed-forward layer goes through a final linear layer that accesses a classifier. The classifier is as big as the number of classes you have, for example, if you have 15000 classes for 15000 words. The output of that classifier will be of size 15000. A soft max layer receives the classifier’s output. The soft max layer produces probability scores between 0 and 1 for each class. We determine our predicted word by taking the index with the highest likelihood score. The decoder didn’t taste the output and added it to the list of decoder inputs and continued decoding again until the “end” token was predicted.

The final class that gets assigned to the end token in our scenario is the prediction with the highest probability. The output is produced in this manner by the decoder. Each layer in the decoder stacks “n” layers high, receiving inputs from the encoder and the layers above it. The model can learn to extract and focus on various attention combinations from its attention heads by stacking layers, which could increase the model’s prediction capacity. And that’s how the transformers work.

Example

Take a case of machine translation. We want to translate English to French. In the encoder part, we pass all the words (English) in the sentence simultaneously and determine the word embeddings. The output sentence in French is sent to the decoder in the decoder section. From there, we obtain the input embeddings, which yield the word vectors. We next add the positional vector to obtain the word’s sense of context in a sentence. Ultimately, we feed this resulting vector into a decoder block, which consists of three primary parts.

a. To show how closely related each word is to every other word in the same sentence, the self-attention block creates attention vectors for each word in the sentence.

b. The encoder-decoder attention block receives these attention vectors as well as vectors from the encoders. The relationship between each word vector and the others will be determined by this attention block. And this is the point at which language understanding takes place. The attention vectors for each word in the input and output sentences make up the output.

c. Each attention vector is then passed to the feed forward unit. Next comes the linear layer (a feed-forward layer). Next, the term that corresponds to the highest probability is the final word, and the softmax layer translates into a probability distribution that can now be understood by humans.

Thanks for reading the article! If you like my article, do 👏 this article. If you want to connect with me on LinkedIn, please click here.

I plan to share additional blog posts covering topics such as robotics, drive-by-wire vehicles, Machine Learning, Deep Learning, Computer Vision, NLP, User Interface Development, etc..

Stay tuned.

This is Karteek Menda.

Signing Off

--

--

Karteek Menda

Robotics GRAD Student at ASU, Ex - Deputy Manager - Analytics, Data Scientist, and a Machine Learning Blogger.