Encoder and Decode explained with practical example

5 min readAug 11, 2023

table of content

Encoder
decoder
training encoder-decoder
practical example
Summary

Encoder and decoder are very good at sequence-to-sequence tasks where the model is given an input of sequence test and supposed to output a sequence of output. It is not the task of NER or classify each word in a particular category. Also here we may have different output sizes, especially in machine translation tasks.

Different architectures can be used as encoder-decoder such as RNN, LSTM, and GRU.

Encoder

So as shown below the encoder part has given a series of text and the task of the encoder is to generate a vector that represents the input sequence entered by the user that vector will be fed into the decoder as input and this vector should represent the input sequence and should contain the essence of the inputs sequence.

This vector is called a context vector.

The entire purpose of the encoder is to generate a contextual representation/ context for the input sequence.

Note: If we want to predict captions then we need to use some advanced convolution network to extract the context vector of the image.

2. Decoder

So as discussed the final hidden state of the encoder is the context vector this context vector will be fed to the decoder part and will be fed to each unit/cell of the decoder part and also each cell of the decoder will receive the output of the previous unit as input which means each unit will receive two input at a time and will produce output based on the input in sequence.

The final encoder and decoder architecture are as shown.

3. Training encoder–decoder Model

The dashed line represent the flow during test or evaluation time.

Again RNN is encoder-decoder training.

The training data consists of sets of input sentences and their respective output sequences. We use cross entropy loss in the decoder. Encoder-decoder architectures are trained end-to-end, just as with the RNN language models. The loss is calculated and then back-propagated to update weights using the gradient descent optimization. The total loss is calculated by averaging the cross-entropy loss per target word.

There are a few points related to this encoder-decoder architecture the seq2seq task we do not need the output of every Encoder just need the output of the last encoder that’s why their output has discarded secondly the input of the decoder is the target word even if the decoder predicts the wrong word still we will use the correct input word like if the first word predicted by the encoder is other than “Yes” we will still use “Yes” we do so to make learning better and fast the last thing is that if at the fourth LSTM cell, the decoder predicts word other then “END” we will still end the output generation the above two concepts combinedly called teacher forcing.

Teacher forcing: forcefully feeding the network the correct input instead of actually predicting in the previous time stamp and the second is terminating the output generation even if it does not predict the “END” of the sentence.

Note:during testing, we don’t have teacher forcing.

Training time: We condition the decoder step t on the ground truth word yt−1 from the previous time step. This is called teacher forcing.
Test time: We don’t have the ground truth yt−1. So we condition the decoder step t on the predicted word yˆt−1 from the previous time step. We can use different method like greedy decoding or Beam search .

Note: If we want to use an encoder-decoder in image tasks we need to find the context vector of the image using CNN like VGG, AlexNet, etc.

4. practical example

Note:in the below example the calculation are not made correctly just assume the result.

In the above image i have assumed my vocabulary consist of only 3 words in case of english and five word in case urdu and secondly i have just encoded the word for real world task encoding is not enough you have to create mebedding of words.

5. Summary

for training encoder decoder model we have the follwing steps to do

convert the word vector into embedding
forward pass the input vector into encoder to produce the context vector
Now forward pass the data through decoder using softmax activation fuction and argmax the output word over the complete vocabulary(specified).
calculate the loss using cross entropy use teacher forcing and back propagate to optimize the model using gradient descenet.
During test time or real time prediction time pass the input through same preprocessing steps and then use Beam search or greedy method and generate words or produce output.

THE END!!!

Follow me for more such content

Thank you

for refrence visit here

Second refrence here

Encoder and Decode explained with practical example

table of content

Written by Muhammad A