A Must-Read NLP Tutorial on Neural Machine Translation — The Technique Powering Google Translate

Prateek Joshi
Jan 31 · 7 min read

Table of Contents

  1. Understanding the Problem Statement
  2. Introduction to Sequence-to-Sequence Prediction
  3. Implementation in Python using Keras

Understanding the Problem Statement

Let’s circle back to where we left off in the introduction section, i.e., learning German. However, this time around I am going to make my machine do this task. The objective is to convert a German sentence to its English counterpart using a Neural Machine Translation (NMT) system.

Introduction to Sequence-to-Sequence (Seq2Seq) Modeling

Sequence-to-Sequence (seq2seq) models are used for a variety of NLP tasks, such as text summarization, speech recognition, DNA sequence modeling, among others. Our aim is to translate given sentences from one language to another.

  • Name Entity/Subject Extraction to identify the main subject from a body of text
  • Relation Classification to tag relationships between various entities tagged in the above step
  • Chatbot skills to have conversational ability and engage with customers
  • Text Summarization to generate a concise summary of a large amount of text
  • Question Answering systems

Implementation in Python using Keras

It’s time to get our hands dirty! There is no better feeling than learning a topic by seeing the results first-hand. We’ll fire up our favorite Python environment (Jupyter Notebook for me) and get straight down to business.

Import the Required Libraries

Read the Data into our IDE

Our data is a text file (.txt) of English-German sentence pairs. First, we will read the file using the function defined below.

data = read_text("deu.txt") 
deu_eng = to_lines(data)
deu_eng = array(deu_eng)
deu_eng = deu_eng[:50000,:]

Text Pre-Processing

Quite an important step in any project, especially so in NLP. The data we work with is more often than not unstructured so there are certain things we need to take care of before jumping to the model building part.

deu_eng

Model Building

We will now split the data into train and test set for model training and evaluation, respectively.

  • For the decoder, we will use another LSTM layer followed by a dense layer
plt.plot(history.history['loss']) plt.plot(history.history['val_loss']) plt.legend(['train','validation']) 
plt.show()
pred_df = pd.DataFrame({'actual' : test[:,0], 'predicted' : 
preds_text})
# print 15 rows randomly 
pred_df.sample(15)

End Notes

Even with a very simple Seq2Seq model, the results are pretty encouraging. We can improve on this performance easily by using a more sophisticated encoder-decoder model on a larger dataset.


Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Prateek Joshi

Written by

Data Scientist (linkedin.com/in/prateek-joshi-iifmite)

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com