Understanding Transformers — Encoder

Sathya Krishnan Suresh
7 min readOct 12, 2022

--

Sathya Krishnan Suresh, Shunmugapriya P

Transformers were introduced in the year 2017 and since then, there has been an explosion in the number of the advanced models-relating to transformer architecture-developed and the uses of these models have really increased the importance of Natural Language Processing even more. Transformers have been able to solve most of the common tasks of NLP with the highest efficiency and the researchers are finding more and more problems where Transformers can be applied. With this introduction, let’s talk about the transformer architecture and mainly the encoder part of the architecture. The code for the article is given here.

Transformer Architecture:
A transformer is made up of an encoder and a decoder and were originally designed for sequence to sequence tasks like machine translation, question answering, etc. The architecture of the original transformer model is given here.

Transformer architecture

The entire grey rounded box on the left is the encoder and similarly on the right is the decoder. The encoder takes in the embeddings which is the sum of the regular embeddings and positional embeddings and outputs tensors of the same shape but those tensors have a lot of information-contextual meaning, part of speech, position of the word, etc-encoded in them. The tensors output by the encoder are often called the hidden state. These tensors are fed to the decoder and they decode the hidden state depending on the task it has to solve and it’s pre-training objective.

The architecture might look complicated but let’s break it down layer by layer and build the architecture from bottom up.

Embeddings:
The raw text data that we have cannot be used to train transformer models as they understand only numbers. Standard operations like tokenization and one-hot encoding were used earlier but none of them were able to fully encode the information from the text into the numbers. Embeddings were the first one to encode understandable information into those numbers or the correct way of saying would be vectors. So rather than passing the tokens to the encoder, embeddings of it are passed. But one major problem was encoding the information about the position of the words in the embeddings. Standard embedding just mapped each token to an n-dimensional vector but it never encoded the position information.

To encode the position information of the tokens, another embedding called positional embedding can be used. In positional embedding instead of feeding the tokens of the words as inputs we feed the position ids or indices of each sentence as inputs. This enables the position embedding layer to give an useful representation of the positions. The final output of the embeddings layer is generated by adding the token embeddings and position embeddings and applying a normalization to the result so that there are no huge values. The code for the embeddings layer is given below.

Self-Attention:
The embeddings that are fed to the encoder are static and have some amount of information encoded in them. But the self-attention layer takes into account, the context of the words or tokens, in a sentence or a set of sentences and encodes those information too in their respective embeddings.

Encoding the context into the embedding is an important task because homonyms in a passage will be mapped to the same embeddings. For example, consider a passage that contains the term ‘bat’. ‘Bat’ in the passage can be used in the sense of a mammal or in the sense of a stick with which players play or it maybe used in both the senses. The embeddings fed into the encoder will have the same embedding value for both the senses but we do not want that. When the self-attention layer is done with the embeddings passed to it, the embeddings obtained will have their context encoded in them. For example, ‘bat’ will have more of the mammal sense instead of the stick sense if it has been used in the passage like ‘Bats live in caves’.

Self-attention encodes the context by using three tensors — Query, Key and Value — and some mathematical operations relating to those tensors. The tensors are generated by using a learnable linear projection of the embeddings passed to the layer. The meaning for these tensors can be understood with a simple example. When you search for ‘Infinity war’ in Google, the search text that you are entering is called a query. Google’s servers would have hash keys for similar queries and those keys are matched with the query to find the similarity scores. Those hash keys are the keys in discussion. Finally content returned to the user is the value.

In the standard attention mechanism the query comes from the decoder and key, value pair is provided by the encoder. But in self-attention mechanism the query also comes from the encoder. The matrix multiplication between the query and key tensors gives the similarity score of a token(word) with all the other tokens(words). Since the values resulting from a matrix multiplication can be large, softmax is applied to the score matrix and finally it is multiplied with the value matrix to generate the outputs of this layer. Given below is the code for the attention layer in PyTorch.

Self-attention code

Multi-Head Attention:
Basically, Multi-Head Attention is made up of a number of self-attention units. This is because each attention head will get to focus on a particular feature of the text. One head can get to focus on the subject-verb relationship and the other might get to focus on the tense of the text and so on. This is similar to using multiple filters in a single convolutional layer and as we know from ensembling, multiple models more often than not lead to good results.

Usually in Multi-Head attention the last dimension — embedding dimension — is split equally among the attention heads for scalability and the outputs of the attention heads are concatenated and a linear transformation is applied on those outputs to get the final output. The linear transformation is applied so that the output generated will be suitable for the feed-forward layer to which it is passed later. The code for this layer is given below. The important point here is to take care of the head dimension and concatenating them back.

Feed-Forward Layer:
The Feed-Forward layer is a simple layer composed of two linear layers, a GELU layer and a dropout layer. In this layer the embeddings are processed independent of each other and it is thought that this is the layer where most of the memorization of the information takes place. Hence whenever transformer models are scaled up, usually the feed-forward layer is the one that is scaled up the most.

Transformer Encoder Layer:
All the layers needed for an encoder layer have been developed and it’s just a matter of putting them together into a single model. But before just blindly following the architecture diagram given above, we should think about the positioning of the skip layer and the normalization layer.

Depending on the position of the normalization layer, there are two types of normalization— Post-layer normalization and Pre-layer normalization. In the former type, layer normalization is applied between two skip connections and the original architecture given above follows this type. In the latter type, the layer normalization is applied within the span of the skip connection (this will be clear when you look at the code). Most of the architectures being used now follow the latter because in the post-layer normalization the weights and gradients diverge, and subsequently training becomes very difficult. The code given below also follows the latter type of normalization.

Transformer Encoder:
We have everything needed to implement the encoder part of the architecture. First we generate the embeddings from the input tokens given and then pass those embeddings through a stack of the encoder layers discussed above. The code again is simple and is given below.

This encoder layer alone can be used as a stand-alone model for a lot of tasks like text classification, masked language modelling, etc by just adding a suitable task dependent body on top of the encoder block.

Conclusion:
The complete structure of the encoder layer has been discussed in this article. Decoder layer will be discussed in the next article. I hope you had fun reading the article as much as I had while writing it.

--

--