A Must-Read NLP Tutorial on Neural Machine Translation — The Technique Powering Google Translate

Published in

Analytics Vidhya

7 min readJan 31, 2019

“If you talk to a man in a language he understands, that goes to his head. If you talk to him in his own language, that goes to his heart.” — Nelson Mandela

The beauty of language transcends boundaries and cultures. Learning a language other than our mother tongue is a huge advantage. But the path to bilingualism, or multilingualism, can often be a long, never-ending one.

There are so many little nuances that we get lost in the sea of words. Things have, however, become so much easier with online translation services (I’m looking at you Google Translate!).

I have always wanted to learn a language other than English. I tried my hand at learning German (or Deutsch), back in 2014. It was both fun and challenging. I had to eventually quit but I harboured a desire to start again.

Fast-forward to 2019, I am fortunate to be able to build a language translator for any possible pair of languages. What a boon Natural Language Processing has been!

In this article, we will walk through the steps of building a German-to-English language translation model using Keras. We’ll also take a quick look at the history of machine translation systems with the benefit of hindsight.

This article assumes familiarity with RNN, LSTM, and Keras. Below are a couple of articles to read more about them:

Understanding the Problem Statement
Introduction to Sequence-to-Sequence Prediction
Implementation in Python using Keras

Understanding the Problem Statement

Let’s circle back to where we left off in the introduction section, i.e., learning German. However, this time around I am going to make my machine do this task. The objective is to convert a German sentence to its English counterpart using a Neural Machine Translation (NMT) system.

We will use German-English sentence pairs data from http://www.manythings.org/anki/. You can download it from here.

Introduction to Sequence-to-Sequence (Seq2Seq) Modeling

Sequence-to-Sequence (seq2seq) models are used for a variety of NLP tasks, such as text summarization, speech recognition, DNA sequence modeling, among others. Our aim is to translate given sentences from one language to another.

Here, both, the input and output are sentences. In other words, these sentences are a sequence of words going in and out of a model. This is the basic idea of Sequence-to-Sequence modeling. The figure below tries to explain this method.

A typical seq2seq model has 2 major components —

a) an encoder
b) a decoder

Both these parts are essentially two different recurrent neural network (RNN) models combined into one giant network:

I’ve listed a few significant use cases of Sequence-to-Sequence modeling below (apart from Machine Translation, of course):

Speech Recognition
Name Entity/Subject Extraction to identify the main subject from a body of text
Relation Classification to tag relationships between various entities tagged in the above step
Chatbot skills to have conversational ability and engage with customers
Text Summarization to generate a concise summary of a large amount of text
Question Answering systems

Implementation in Python using Keras

It’s time to get our hands dirty! There is no better feeling than learning a topic by seeing the results first-hand. We’ll fire up our favorite Python environment (Jupyter Notebook for me) and get straight down to business.

Import the Required Libraries

Read the Data into our IDE

Our data is a text file (.txt) of English-German sentence pairs. First, we will read the file using the function defined below.

Let’s define another function to split the text into English-German pairs separated by ‘\n’. We’ll then split these pairs into English sentences and German sentences respectively.

We can now use these functions to read the text into an array in our desired format.

data = read_text("deu.txt") 
deu_eng = to_lines(data) 
deu_eng = array(deu_eng)

The actual data contains over 150,000 sentence-pairs. However, we will use only the first 50,000 sentence pairs to reduce the training time of the model. You can change this number as per your system’s computation power (or if you’re feeling lucky!).

deu_eng = deu_eng[:50000,:]

Text Pre-Processing

Quite an important step in any project, especially so in NLP. The data we work with is more often than not unstructured so there are certain things we need to take care of before jumping to the model building part.

(a) Text Cleaning

Let’s first take a look at our data. This will help us decide which pre-processing steps to adopt.

deu_eng

We will get rid of the punctuation marks and then convert all the text to lower case.

(b) Text to Sequence Conversion

A Seq2Seq model requires that we convert both the input and the output sentences into integer sequences of fixed length.

But before we do that, let’s visualise the length of the sentences. We will capture the lengths of all the sentences in two separate lists for English and German, respectively.

Quite intuitive — the maximum length of the German sentences is 11 and that of the English phrases is 8.

Next, vectorize our text data by using Keras’s Tokenizer() class. It will turn our sentences into sequences of integers. We can then pad those sequences with zeros to make all the sequences of the same length.

Note that we will prepare tokenizers for both the German and English sentences:

The below code block contains a function to prepare the sequences. It will also perform sequence padding to a maximum sentence length as mentioned above.

Model Building

We will now split the data into train and test set for model training and evaluation, respectively.

It’s time to encode the sentences. We will encode German sentences as the input sequences and English sentences as the target sequences. This has to be done for both the train and test datasets.

Now comes the exciting part!

We’ll start off by defining our Seq2Seq model architecture:

For the encoder, we will use an embedding layer and an LSTM layer
For the decoder, we will use another LSTM layer followed by a dense layer

We are using the RMSprop optimizer in this model as it’s usually a good choice when working with recurrent neural networks.

Please note that we have used ‘sparse_categorical_crossentropy‘ as the loss function. This is because the function allows us to use the target sequence as is, instead of the one-hot encoded format. One-hot encoding the target sequences using such a huge vocabulary might consume our system’s entire memory.

We are all set to start training our model!

We will train it for 30 epochs and with a batch size of 512 with a validation split of 20%. 80% of the data will be used for training the model and the rest for evaluating it. You may change and play around with these hyperparameters.

We will also use the ModelCheckpoint() function to save the model with the lowest validation loss. I personally prefer this method over early stopping.

Let’s compare the training loss and the validation loss.

plt.plot(history.history['loss']) plt.plot(history.history['val_loss']) plt.legend(['train','validation']) 
plt.show()

As you can see in the above plot, the validation loss stopped decreasing after 20 epochs.

Finally, we can load the saved model and make predictions on the unseen data — testX.

These predictions are sequences of integers. We need to convert these integers to their corresponding words. Let’s define a function to do this:

Convert predictions into text (English):

Let’s put the original English sentences in the test dataset and the predicted sentences in a dataframe:

pred_df = pd.DataFrame({'actual' : test[:,0], 'predicted' : 
                        preds_text})

We can randomly print some actual vs predicted instances to see how our model performs:

# print 15 rows randomly 
pred_df.sample(15)

Our Seq2Seq model does a decent job. But there are several instances where it misses out on understanding the key words. For example, it translates “im tired of boston” to “im am boston”.

These are the challenges you will face on a regular basis in NLP. But these aren’t immovable obstacles. We can mitigate such challenges by using more training data and building a better (or more complex) model.

Access the full code here.

End Notes

Even with a very simple Seq2Seq model, the results are pretty encouraging. We can improve on this performance easily by using a more sophisticated encoder-decoder model on a larger dataset.

Another experiment I can think of is trying out the seq2seq approach on a dataset containing longer sentences. The more you experiment, the more you’ll learn about this vast and complex space.

Feel free to reach out to me at prateekjoshi565@gmail.com for 1–1 discussions.

Originally published at www.analyticsvidhya.com on January 31, 2019.