Analytics Vidhya
Published in

Analytics Vidhya

Baby steps in Neural Machine Translation Part 2 (Decoder) — Human is solving the challenge given by the God

  • Walk you through the decoder of machine translation system
  • Step-by-step, guide you through the codes with brief explanation
  • Provide codes and data for you to train customized machine translation system.

This is the follow-up from part 1 — Encoder of Machine translation system. Part 1 has a brief explanation on the codes and the flow of tensors through the encoder. If you haven’t gone through Part 1 , I strongly encourage you to go through the article as you will have full understanding on the encoder. If you have gone through the article, let’s us continue with decoder part.

Decoder part of machine translation system

The decoder part is very interesting and the gist of decoder is the look-ahead mask. Because of this mask, we filter out the using of future words in predicting the current word. What does this mean?

During training, we pass the the entire target sentence to the input of the decoder by appending <start> label to the target sentence while we are using target sentence without <start> label as the output of the decoder to train the decoder. This also means that we are using current word to predict next word.

But there is something fishy here… We pass the entire target sentence with label <start> to the decoder during training. The decoder will use the future word information to predict my current word. For example: during training, the decoder input is (<start>, 我, 去, 学校) and the decoder output is (我, 去, 学校, <end>). Since we pass the entire target sentence to train the decoder, the decoder will be trained to use the future words information such as “去”, “学校” to predict “我”. This does not make sense!!! Because during prediction state, we do not know the future words. We could only predict “我” using “<start>” information. This is solved by using look-ahead mask. This mask will remove the future words information during training and this is done elegantly.

So now, we know the big picture of decoder. From here onward, we are going deeper with the flows and transformations of tensors. To make the flows clearer, we make the assumption here that the target sentence is tokenised, indexed and appended with labels (<start>, <end>) having length of 26, and we process 64 target sentences in parallel. if the target sentence does not have 26 indexed words, the sentence will be appended with 0 until 26 words.

sentence 1 : "我 去 学校" => [1, 2] => [1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0...]

So the dimension would be (64, 26). Then, this tensor (64,26) will flow through embedding layer and add in position information similar to the encoder part as explain in part 1.

Positioning Embedding Formula

The decoder input tensor with position information would now have the dimension of (64, 26, 512). Up until now, everything is similar to encoder. The real magic happens with the tensor flows to the masked multi-head attention.

Masked multihead-attention

As we know earlier, multi-head attention just split the the 512 features of (64, 26, 512) into 8 groups of 26 words with 64 features (64, 8, 26, 64) and process them in parallel.

  • softmax(Q*Transpose(K)) = softmax( (64, 8, 26, 64) * (64, 8, 64, 26) ) = softmax((64, 8, 26, 26)) = (64, 8, 26, 26)
  • softmax(Q*Transpose(K)) * v = (64, 8, 26, 26) * (64, 8, 26, 64) = (64, 8, 26, 26)

But this process is altered to include look-ahead mask which has the tensor dimension of (64, 1, 26, 26).

  • softmax(Q*Transpose(K)) = softmax( (64, 8, 26, 64) * (64, 8, 64, 26) ) = softmax((64, 8, 26, 26)) = (64, 8, 26, 26)
  • mask(softmax(Q*Transpose(K))) = (64, 8, 26, 26)-(64, 1, 26, 26) = (64, 8, 26, 26)
  • softmax(Q*Transpose(K)) * v = (64, 8, 26, 26) * (64, 8, 26, 64) = (64, 8, 62, 26)

Let’s look into the details of look-ahead mask.

padding mask and look-ahead mask

The look-ahead mask is created together with the padding mask. The shown example here is padded to length of 10 instead of 26 for simplicity. But the idea is the same. Let’s us look into the padding mask. This mask is very simple, it just assign “1” when we pad new words to the sentence. The purpose of this is to optimize the loss calculation. When we train a model, we do not want to take into account the losses coming from the padded words. And for the look-head mask, it is a square matrix of the sentence length. (10, 10). If our sentence length is 26, the look-ahead mask will be (26, 26). The look-ahead assign “1” when we want to remove the information from the sentence. Let’s look at row “w1”, we only want to keep the information of first word and remove the information of other words. Now, we notice that we are blocking the decoder from looking into the future words information. Furthermore, we want to block the padded words information as well. So, we compare the 2 masks and find the max value between the 2 masks. Please keep in mind that “1” means : remove the particular information. We will know later how to remove the particular information.

Find the max value between padding mask and looked ahead mask

Since we know the look-ahead mask. Let’s look back into masked multi-head attention steps

  • softmax(Q*Transpose(K)) = softmax( (64, 8, 26, 64) * (64, 8, 64, 26) ) = softmax((64, 8, 26, 26)) = (64, 8, 26, 26)
  • mask(softmax(Q*Transpose(K))) = (64, 8, 26, 26)-(64, 1, 26, 26) = (64, 8, 26, 26)
  • softmax(Q*Transpose(K)) * v = (64, 8, 26, 26) * (64, 8, 26, 64) = (64, 8, 62, 26)

As we have discussed in encoder part : softmax(Q*Transpose(K)) means that we try to represent the word from the perspective of other words. In decoder part, we try to represent the target word from the perspective of other target words.

Recap: Word information representation from other words

But, we don’t want to represent the target word from future words. For example: we don’t want to represent “我” with information from “去”, “学校”. So, we minus the word representation from look-ahead mask. mask(softmax(Q*Transpose(K))) minus the target word representation(64, 8, 26, 26) with “1”s generated in look-ahead mask(64, 1, 26, 26). Now, we know that the target word representation does not contain future words information. This is superb!!!!!

Then, the target word representation is multiplied with the language information tensor V (64, 8, 26, 64) to regain language information and after that, we concatenate back 8 groups to form 512 features (64, 26, 512). These processes are same as encoder part.

Now, we have the output from the masked multi-head attention which is the target word representation without future words information (64, 26, 512). The tensors will flow through normalization process. This is the process that will make the training and prediction faster. The explanation of this process is shown in previous encoder article.

Up until now we have the output from encoder (64, 62, 512) and output from masked multi-head attention (64, 26, 512). How do we join the tensors for these 2 outputs?

The answer is shown in the red circle. Use another multi-head attention module. Now, Q = (64, 26, 512) which is the decoder output, K = V= (64, 62, 512) which is the encoder output. So, the following processes repeat again.

  • softmax(Q*Transpose(K)) = softmax( (64, 8, 26, 64) * (64, 8, 64, 62) ) = softmax((64, 8, 26, 62)) = (64, 8, 26, 62)
  • softmax(Q*Transpose(K)) * v = (64, 8, 26, 62) * (64, 8, 62, 64) = (64, 8, 26, 64)

As we have gone through so many time, we know that this process softmax(Q*Transpose(K)) is actually represent the word information from other words perspective. This time, we are representing the target word from the perspective of source words. eg. We are representing the “我” from the perspective of “I”, “go”, “to”, “school”. sofmax means that we want to find the most appropriate source word representation to represent “我”. In this example, the best source word representation would be “I”. After finding the best word representation, we need to get the language information for the source words which is performed in softmax(Q*Transpose(K)) * v process. Finally, we will get a tensor (64, 8, 26, 64) of target word with the source words representation and language information. Another superb move!!!

Finally, the loss function used to trained the entire machine translation is just a cross entropy function. We have reached the end of decoder which also marked the end of machine translation system. If you have followed me until here, I hope that you gain a lot from reading my blog. This is the codes provided by tensorflow and I find that this is very useful. When you run the codes and find the difficulty in understanding the codes, feel free to revisit this article. You may also refer to “Attention is All you need” article. I feel that these guys are genius. If you would like to train your own model, you may use the data from Open Subtitles (OPUS). I have trained a Malay to English translation Model and I would like to show some results here. Though I only train with 20 epoch and 80k dataset, I feel the translation is quite ok.

Input : kenapa kau tidak ikut kami sahaja ?
Predicted translation : why don't you come with us?
Real translation: why don't you just follow us ?
Input : jika aku menjadi orang kaya.
Predicted translation : if i was a kid.
Real translation: if i was a rich guy .

Hope you can enjoy your modelling. See you in next article.




Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem

Recommended from Medium

How machine learning lifecycle is different than software development lifecycle?

Machine learning: Types (part-1)

Real-time Fraud Detection With Machine Learning

Review DeepLabv3 (Semantic Segmentation)

Trapper: An NLP library for transformer models

Variational Autoencoders -EXPLAINED

Machine Learning with SQL

Perplexity …

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alex Yeo

Alex Yeo

More from Medium

What actually Rectified linear activation function (called ReLU) is? [Layman approach]

How we use Self-Supervised Learning in Personalised Medicine

Natural Language Processing Basics: Word2Vec, CBOW & Skip-gram

Machine Learning Helps Fix Broken Hearts