Neural Machine Translation using Simple Seq2Seq Model

Published in

Analytics Vidhya

13 min readMay 4, 2020

Language translation plays a very crucial role in this modern era of globalization where countries are connected with each other in some way. Irrespective of English being the universally accepted language to communicate, there are still many countries who prefer their native language to communicate. If we keep the global perspective aside for a second there are still people who understand things better in a language they are native to. Hence considering all these needs language translation is the part of human civilization from a very long time. Before the advancement of technology, there are humans who used to act as an interpreter between peoples. But soon after the technology evolves we created machines to do this translation. This machine translation keeps on evolving as time passes and researchers keep researching to make machine translation more accurate and correct. Soon after the advancement of deep learning and neural nets like RNN’s, LSTM’s and more sequence-based neural nets building a simple language translator is not that difficult nowadays.

So in this tutorial blog, I will show how you can create a simple neural machine translator to translate English sentences into Marathi sentences. I have chosen a Marathi sentence because it helps me to understand the translation in a better way. Also, the Hindi language data set is very small to work with. So without wasting more time lets get started.

Table of Content

Introduction
Data Collection and Cleaning
Input Creation for encoder and decoder
Model building
Inference Model Building
Prediction
Model Performance
Scope of Improvement
Conclusion
Reference
Connect with me

1. Introduction

In this section, I will give a small introduction of what neural machine translation also called NMT is all about. So, NMT is build using a neural network hence “N” stands for “Neural” and since it uses a machine to do the translation “M” stands for “Machine” and in the end, it’s doing translation hence ‘T’ stands for “Translation”. I hope you have understood till here and for a non-machine learning and deep learning readers you guys are wondering what’s this “Seq2Seq Model” for them I will try to explain it in short, In deep learning, we make use of neural networks to train our deep learning model(that makes predictions) so there are some neural networks like RNN, LSTM, GRU that works on a sequence as inputs like a sequence of English sentence in our case. So Seq2Seq model is a representation of these special neural networks. The Seq2Seq model has two parts one is called an encoder and the other is called a decoder and as the name suggests encoder encodes our input English sentence and decoder decodes these encoded English sentence into Marathi sentence. So this is a simple introduction of a simple seq2seq model and based on further improvements and research we keep on finding new techniques to improve the encoding of the input sequence but the heart of seq2seq model is encoder and decoder. For more information on encoder and decoder, you can refer to this link.

2. Data Collection and Cleaning

So after the quick introduction of NMT and seq2seq model its time for us to make our hands dirty by digging into the coding part of this NMT. In this section, we are going to get our raw data and perform some basic text cleaning. To get the dataset you can refer to this link.

Before loading the data from the link I have given in the above paragraph we need to import all the necessary libraries we are going to use in this project.

Import Libraries

Now as we imported all the libraries we can now move ahead to load our raw text data file.

with open('mar.txt','r') as f:
  data = f.read()# len(data)

The above gist shows the content of the raw text file and you can see that we need to do the cleaning of this file so let’s get into the data cleaning process.

In the above code, I am spitting each line of the raw data text and storing it into the list named uncleaned_data_list. After that, I am separating all the English and Marathi sentences and storing them into english_word and marathi_word list respectively and in the end, I have printed the total number of English sentences and Marathi sentences.

After that, I created a pandas DataFrame with two columns named English and Marathi and saved it as a CSV file.

language_data.head()

language_data.tail()

You can see the head and tail data of our newly formed pandas DataFrame the reason for representing the data into this format is it makes the handling of data more efficient and also with this we can take the advantage of panda’s tools as well. Now let’s move further into our text cleaning process.

I think the above code section is self-explanatory but still, I will explain it in short. In the above code section, I am doing text preprocessing which includes some basic things like lowercase the text, removing punctuation, removing digits, and whitespace.

# Putting the start and end words in the marathi sentances
marathi_text_ = ["start " + x + " end" for x in marathi_text_]

The above code section seems to be simple but it is the most important step in neural machine translation using the seq2seq model. I am going to explain the reason for adding “start” and “end” tags in the Marathi sentences when we are going to build the Seq2Seq model until please be patient and with this, we came to the end of data collection and data cleaning section.

Input Creation for encoder and decoder

Till this section, we have our clean and processed text dataset and in this section, I will show you how you can prepare the data for your NMT model.

Before we move into the data preparation step we will split our data into train and test here I am doing 90–10 split.

The first thing you have to keep in mind is that the seq2seq model wants all the input sequence to be of same length so one way to achieve this is, we calculate the length of each sentence in both English and Marathi and then we choose the max length i.e the longest sentence in both English and Marathi thus, in the end, we have two max_length one for English sentence and another is for Marathi sentence the same you can see in the code section also. I will tell how you can use these max lengths to make all the sequence of English and Marathi of equal length in later sections. After this, since machines understand numbers only and not the text we need a way to convert the input text sequence into number and one such way is to index the words of the sentences. We can do this indexing using the “Tokenizer” method of Keras we also need to take the vocabulary size of both English and Marathi corpus which is required when we create the input data for the model training. All the things which I have explained now are done in the above code block.

The code below is long but don’t be scared I am going to explain it completely.

So as you can see I have created a function called “generator_batch” which accepts three-parameter “X”, “Y”, “batch_size” and you can figure out by yourself what all these parameters mean. Now the first thing in the function is a “While True” i.e a while loop and this loop is set to be always running ok! fine I can understand till here but what is happening inside the loop? The same question came to my mind when I first saw this code on one of my referenced blog that I have used to create this project. Actually, it’s not that complex to understand. We are running a “for loop” from 0 to length of our X(train data) and which is stepped by batch_size that is 128 in this case. Inside the loop, I have three variables “encoder_data_input”, “decoder_data_input”, “decoder_target_input” now let’s see each variable in more detail I have created a metrics of all zeros having No. of rows equal to “batch_size” and columns equal to “max_length_english” and similarly we do for the decoder input but can you see the decoder target data it is of 3-dimensional shape because decoder output is 3-dimensional. But wait I have told you guys that decoder is what that gives you the translated sentence as output then why we have created the decoder input also. This is because it makes the training faster and this method is called the “Teacher Forcing” technique. In teacher forcing, we pass the target data as the input to the decoder for example if the decoder is to predict 'the' then we also pass ‘the’ in the input of decoder as well you can relate it with the actualy teacher student learning process. Hence it makes the learning process faster.

So till now, we have created the three matrics of zeros. In the nested for loop, I am taking the index, X_train, and Y_train data one by one there is another for loop which is taking the word from the outer for loop X_train sentence and also the index and after that, I am filling our encoder_data_input zeros metric with the respective index of the words in the input sentence here the outer loop “i” is going row-wise and the “t” of the nested loop is going columns wise and as you observe the all the input sequences are going to be of the same length because I have set the column equal to the max_lenght_english so if the input sequence is less then max_length_english we are padding zeros for rest of its columns. A similar thing was done in the for loop of the decoder_input_data but not in the decoder_target_input remember we have padded the “start” and “end” tags in the target Marathi sentences so at t = 0 we have “start” as the word and we do not want this word in the target input sequence hence we start the sequencing for the decoder_target_input from t>0.

Till now, I have created our input and outputs for the seq2seq model and using yield I am sending the data batch-wise and not all at the same time. In the following section, I will tell you the reason for appending “start” and “end” tags in the target sentences also working of encoder and decoder in detail and the model code as well.

Model building

The above figure is a representation of simple encoder and decoder the horizontal axis here is the timestamp and ‘A’, ‘B’, ‘C’ is the words of the input sequence. So now suppose we have 10 input sentences and are stored in our encoder_data_input metrics each row in the metric is one sentence hence we have 10 rows and max_length_english columns and say X1, X2, ……., X10 is used to represent each sequence.

Now,

For X1 say we have 3 words in the sequence then,

at t = 0 we pass w1 into the encoder and take its state

at t = 1 we pass w2 into the encoder along with the state at t = 0

at t = 2 we pass w3 into the encoder along with the state at t = 1

now at the end of the sequence, we take the output of the encoder and its state at t=2 and pass them as the input to the decoder, and then the decoder output at t=0 is passed as input to the decoder at t=1 and so on till the end of the string is not reached.

Now the reason for appending “start” and “end” to the decoder input is, “start” tag is used to initiate the decoder to start decoding, and the “end” tag is used to signal decoder to stop decoding process. If we do not use this then the decoder will never be able to produce the first word as we know that if we pass the word to the decoder as input it will predict the next word of the sequence and this prediction will never stop as it is not able to know when to stop.

The above code is doing exactly what I have explained before and also shown in encoder-decoder image. Try to understand the code in reference to the above encoder-decoder image.

plot_model(model, to_file='train_model.png', show_shapes=True)

In the above code section, I am saving the trained model and loading it please note one thing save the model like this only and not any other way because we have to create the inference encoder and decoder to make a prediction later and that model makes use of trained model layers also.

Inference Model Building

Below section is a code of inference model this model is used to make a prediction.

The encoder inference model is similar to the training model the difference is in the decoder inference model. If you see the encoder-decoder diagram we can see that the states of the previous timestamp are pass to the next timestamp hence we need to find a way to preserve the states of the previous timestamp the below code in the decoder inference is doing the same thing.

#inference decoder
# The following tensor will store the state of the previous timestep in the "starting the encoder final time step"
decoder_state_h_input = Input(shape=(latent_dim,)) #becase during training we have set the lstm unit to be of 50
decoder_state_c_input = Input(shape=(latent_dim,))
decoder_state_input = [decoder_state_h_input,decoder_state_c_input]

Then we create the decoder inference model to create the decoder model we have defined the input layer first this is the input layer of the trained decoder model. After that, we create the same layers as we have in the training decoder model but the only difference is the initial state of the decoder LSTM is set to be the states of the previous timestamp as we can see in the diagram of encoder-decoder above, rest of the inference model is similar to the training decoder model.

# inference decoder input
decoder_input_inf = model_loaded.input[1] #Trained decoder input layer
# decoder_input_inf._name='decoder_input'
decoder_emb_inf = model_loaded.layers[3](decoder_input_inf)
decoder_lstm_inf = model_loaded.layers[5]
decoder_output_inf, decoder_state_h_inf, decoder_state_c_inf = decoder_lstm_inf(decoder_emb_inf, initial_state =decoder_state_input)
decoder_state_inf = [decoder_state_h_inf,decoder_state_c_inf]
#inference dense layer
dense_inf = model_loaded.layers[6]
decoder_output_final = dense_inf(decoder_output_inf)# A dense softmax layer to generate prob dist. over the target vocabulary

decoder_model = Model([decoder_input_inf]+decoder_state_input,[decoder_output_final]+decoder_state_inf)

Prediction

Now we have trained the seq2seq model and created the inference model using the trained model for making prediction so its time to make a prediction using the following code.

The above function is returning us the translated sentence it first encodes the input sentence and gets the state value of the last timestamp of the encoder. Then we have created the metric of single value called “target_seq” which holds the index of the next predicted Marathi word but to start the decoding process we need to pass the first word i.e “start” in this case hence, in the beginning, we are going to store the index of the word “start” into the “target_seq” then since we have to carry out the decoding process till we reach the “end” word. Thus we need to call the decoder inference model in a loop. In the loop, we pass the “target_seq” and the encoder last timestamp “states” as the input to the decoder inference model. The inference decoder_model returns the next word index and the decoder states, after this, we convert that index back to the word and add it to the string variable “decoder_sentance”, then we put the newly generated word index into the “target_seq” and update the “state_value” with the decoder model states this process continues till we get the predicted word as “end”. Hence at the end of the while loop, we get the complete translated sentence. The following is some of the output I get.

for i in range(30):
  sentance = X_test[i]
  original_target = y_test[i]
  input_seq = tokenizer_input.texts_to_sequences([sentance])
  pad_sequence = pad_sequences(input_seq, maxlen= 30,padding='post')
  # print('input_sequence =>',input_seq)
  # print("pad_seq=>",pad_sequence)
  predicted_target = decode_seq(pad_sequence)
  print("Test sentance: ",i+1)
  print("sentance: ",sentance)
  print("origianl translate:",original_target[6:-4])
  print("predicted Translate:",predicted_target[:-4])
  print("=="*50)

Model Performance

The performance of the model is good as compared to the amount of data used to train the seq2seq model I have got the accuracy of 54.8% for a model trained on 34825 dataset for 50 epochs. The limitation of the simple seq2seq model is that it is not able to translate a lengthy sentence that efficiently. So this model also has this limitation.

Scope of Improvement

Since this is just a simple model to show you how to create a neural machine translator and as we know that no solution is perfect and everything keeps improving with time this solution is not the only solution, you can improve it further as per your understandings and skills.

Following things I can suggest to Improve this model further:

we can train the model on a large dataset with lots of variation in it.
For lengthy Sentance limitation of it, we can include the Attention Mechanism.
We can try replacing GRU with LSTM and check the performance of the model.

Conclusion

This is the conclusion section of the blog. This blog shows you what neural machine translation is all about, what is the need for translation in this world, and how you can create an NMT by yourself using deep learning, you have learned what seq2seq model in detail. In the next blog, I will show you how can you deploy this NMT and create a REST API using Flask so that you can access your model through any application. The following is the video of the end product we are going to build in the next blog. So follow me on medium so that you can receive the notification of my next blog whenever it gets publish.

References

A ten-minute introduction to sequence-to-sequence learning in Keras

I see this question a lot -- how to implement RNN sequence-to-sequence learning in Keras? Here is a short introduction…

blog.keras.io

keras-team/keras

Deep Learning for humans. Contribute to keras-team/keras development by creating an account on GitHub.

github.com

How to Develop an Encoder-Decoder Model for Sequence-to-Sequence Prediction in Keras - Machine…

The encoder-decoder model provides a pattern for using recurrent neural networks to address challenging…

machinelearningmastery.com

Encoder-Decoder Long Short-Term Memory Networks - Machine Learning Mastery

Gentle introduction to the Encoder-Decoder LSTMs for sequence-to-sequence prediction with example Python code. The…