What are Transformers models- part 2

Vivek Muraleedharan
Nerd For Tech
Published in
3 min readJul 12, 2021

In the previous article we talked about the transformers models and their application in different use cases. In this article we are going to dive deep into the architecture of the Encoder blocks, one of the main building block of a transformer model.

Encoder Blocks

The encoding component in transformer model is basically a stack of encoders which are identical in structure having two sublayers a self attention & Feed-forward Neural network. The encoder convert each word in the sentence into sequence of numbers (one sequence of numbers of each word) which can called the feature vector or feature tensors and the dimension of the this feature vector is defined by the architecture of the model. BERT model have 768 dimension i.e. for each word the encoder will have vector of 768 dimension as output.

Encoder Model representation

The feature vector representation of each word is contain the context of the sentence, we can say that each feature vector values will take account for the previous and the next word in the sequence this is achieved by the self attention layer — a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. The outputs of the self-attention layer are fed to a feed-forward neural network then sends out the output upwards to the next encoder.

The word at each position passes through a self-attention process. Then, they each pass through a feed-forward neural network — the exact same network with each vector flowing through it separately.(courtesy : Jay Alammar’s Blog)

The word embedding only happens in the bottom-most encoder. The abstraction that is common to all the encoders is that they receive a list of vectors each of the size 512. After embedding the words in our input sequence, each of them flows through each of the two layers of the encoder.

Little about Self Attention

The concept of self attention is similar to the RNN working principle, in RNN the hidden state and gates helps to keep some words of a sentence which are relevant for the prediction of the next word, self attention is kind of similar method which used in transformers to understand the relevance of words in the sentence. The calculation of self attention include calculating the Query(Q),Key(K) and Value(V) matrices from the word embedding of the input and finally adding to a softmax layer to get the values of each words in the sentence. You can checkout more about it here.

Each sub-layer (self-attention, feed-forward neural network) in each encoder has a residual connection around it, and is followed by a layer-normalization step. The visualization will be look like this.

(courtesy : Jay Alammar’s Blog)

Encoder Models

Encoder models use only the encoder of a Transformer model. At each stage, the attention layers can access all the words in the initial sentence. These models are often characterized as having “bi-directional” attention, and are often called auto-encoding models. The pretraining of these models usually revolves around somehow corrupting a given sentence (for instance, by masking random words in it) and tasking the model with finding or reconstructing the initial sentence.

Encoder models are best suited for tasks requiring an understanding of the full sentence, such as sentence classification, named entity recognition (and more generally word classification), and extractive question answering. BERT is classic example of an encoder only model and some other models are,

Next article we will discuss about the Decoder block the next building block of a transformer model.

--

--