Neural Machine Translation : Superior Seq2seq Models With OpenNMT

Akash Singh
Jun 5 · 11 min read
Language understanding is quintessential to erasing human barriers.

Language forms the very basis of communication, and wouldn’t it be absolutely great to have a uniform language for everyone to communicate in? It would certainly break all barriers and bring the entire world together. The same is depicted in the movie “Arrival”, where a human-linguist, portrayed by Amy Adams, is endowed with knowledge to understand the alien’s language and overcome even the barriers of time. It’s a worthwhile watch, if you haven’t yet watched it.

As that scenario isn’t a veritable possibility, we take resolve in Machine Translation with the help of sequence to sequence modelling (seq2seq) based on Long short term memory system(LSTM), which I shall thoroughly explain in this article.

A Quick Primer on Machine Translation

Machine Translation is a task in where, given a source language, we translate it to a target language. By replacing one word from a source language to a similar word in the target language, we achieve language translation.

However, this method doesn’t keep the context intact. It is unlikely that two similar words in different languages can have a consistent meaning contextually.

Therefore, this article we shall see how Neural Machine Translation (NMT) can overcome that challenge, and understand its underlying mechanics. Further, with the help of an open source NMT framework called OpenNMT-py, we shall implement our own NMT system for English-to-German.

Content Overview

This blog is divided into two parts. Firstly, we shall see an overview of sequence to sequence models (seq2seq). Later, we will have an overview of the OpenNMT toolkit, before getting hands-on to use it to make our own Machine Translation System.

1. Sequence to Sequence models.

2. OpenNMT-py for Neural Machine Translation (NMT)

Schematic view of neural machine translation

Understanding Sequence To Sequence (Seq2Seq) Models

Machine translation models are based on the sequence to sequence models, in which we have an encoder which learns the source language and a decoder which learns about the target language and decodes the encoded source sentence.

So, an encoder and decoder are two major components of the translation system. In this way, they keep the context of the sentence, as opposed to simple word to word translation without context. Encoder and Decoder help the model gain a deeper understanding of the two languages.

In a Seq2seq model, the source language is first transformed into vectors using word-embedding, and a context vector is also maintained. We will also be able to set the size of our word vectors when we start with making the model. It is regulated by setting the number of hidden units in the encoder RNN (Recurrent Neural Networks).


Long Short Term Memory Networks (LSTM) is a type of RNN where, it keeps the longer context which other RNN’s could not keep. It also has a similar chain like structure but instead of just simple tanh layer, it has a different repeating structure.

The core of the LSTM has cells and gates which helps it keeping the context. As you can see the architecture above, it has a straight line or think of it as a pipe carrying water from one end to other, and in between some contribution is made by other pipes too and these contributions can be regulated.

We can regulate whether to add or remove information to the cells with the help of gates. Gates are composed out of a sigmoid neural net layer and a pointwise multiplication operation. The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through!”.

There are three gates to handle things in the right way, “forget gate” This gate decides what information should be thrown away or kept, “input gate” is used to update the cell state and the last gate is the “output gate”, which decides what the next hidden state should be. For a much deeper understanding, check Colah’s blog on LSTM.

Apart from LSTM’s, there is a Transformer model too but keeping things short we can move forward. Below I will provide you the sources to grasp an understanding of the Transformer and how it is more efficient in keeping larger context. And please go through the paper on Transformer.

Suppose we have a sentence “hello world”. In this sentence every word is represented as a vector as [w0,w1]. Now, we pass it through an LSTM network, and the output of the last state is saved as encoded text, while we also store the context along the path.

[e0,e1] is the hidden state and final encoded text is represented by e, which is equal to e1.

Encoder Architecture

The input to the decoder is the encoded text e and it is passed through another LSTM layer to translate word by word. There is an additional special vector too named as “sos” start of sentence. The LSTM is fed e as hidden state, and sos as the input. The decoding stops when the predicted word is a special end of sentence token “eos”.

Decoder Architecture

For a more deeper and visual understanding of this concept look at Jay Alammar sir blog. “Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention)”.

OpenNMT-py for Neural Machine Translation


OpenNMT is an open source ecosystem for neural machine translation and neural sequence learning. It has a great community of developers. It is designed keeping in mind the code modularity, efficiency, extensibility. Started in December 2016 by the Harvard NLP group and SYSTRAN, the project has since been used in several research and industry applications. It is currently maintained by SYSTRAN and Ubiqus. They have the production-ready code and are used by several companies.

OpenNMT is a complete library for training and deploying neural machine translation models. The system is successor to seq2seq-attn developed at Harvard, and has been completely rewritten for ease of efficiency, readability, and generalizability. It includes vanilla NMT models along with support for attention, gating, stacking, input feeding, regularization, beam search and all other options necessary for state-of-the-art performance. The main system is implemented in the Lua/Torch mathematical framework, and can be easily be extended using Torch’s internal standard neural network components. It has also been extended by Adam Lerer of Facebook Research to support Python/PyTorch framework, with the same API.” — from paper

OpenNMT provides implementations in two popular deep learning frameworks, Pytorch and Keras -

  1. OpenNMT-py: This is implemented using pytorch deep learning framework and we are going to use this only. It is extensible and has fast implementation with PyTorch ease of use.
  2. OpenNMT-tf: It is based on tensorflow deep learning framework.

Each implementation has its own set of unique features but shares similar goals:

  • Highly configurable model architectures and training procedures
  • Efficient model serving capabilities for use in real-world applications
  • Extensions to allow other tasks such as text generation, tagging, summarization, image to text, and speech to text.

It is Project which at its core has support for different techniques for machine translation from vanilla NMT to attention, gating, stacking, input feeding, regularization, copy models, beam search and all other options necessary for state-of-the-art performance. And all this under 4k lines of code, thanks to PyTorch. Here, we are going to see the overview of the paper. The system is built by keeping in mind system-efficiency and modularity.

OpenNMT-py Architecture

System Efficiency

The toolkit is made while keeping in mind the system efficiency to train the model as NMT models can take days to train, so an inefficient architecture will lead to several problems. For that, a proper architecture which handles things properly is required.

As NMT models can take days to train, an inefficient architecture will lead to several problems. OpenNMT-py has an efficient architecture which handles these things efficienty.

Memory Sharing and Sharding

Training on GPU memory is one of the restrictions while allocating batch size, which hampers the training time of the NMT models. Here, they have implemented an external memory sharing system that exploits the known time-series control flow of NMT systems and aggressively shares the internal buffers between clones. They have also implemented a data sharding mechanism for data loading and for starting training on large datasets.

Sharding is basically a technique in which we perform horizontal database partitioning that separates very large databases into smaller, faster, more easily managed parts called data shards

The System also supports multi-GPU by data parallelism. this multi-GPU setup leads to a speedup in training.

The Toolkit has a modular code base. They have kept in mind the readability and extensibility of the code for easy modification in different parts of the system. The code is separate for training, optimization, inference, preprocessing and different modules are there for users to try different techniques for creating an NMT system. Each module in the library is highly customizable and configurable, with multiple ready-for-use features.

Now, that we have an Overview of the OpenNMT-py toolkit, in the next part, we will make a small translation system using this framework. Below is its comparison with different modeling techniques, and for a deeper understanding you can go through the paper and docs.

Comparison with different modeling techniques

Translation Experiment Based On OpenNMT-py

This is a small head start tutorial to train an NMT model using OpenNMT-py. Here, we will be using a small parallel dataset available in the repo for English to German.


  1. Creating a conda environment with python 3.6
conda create -n opennmt python=3.6

2. Activating the environment.

source activate opennmt

3. We are going to install from the source.

git clone
cd opennmt
pip install -r requirements.txt

Now that everything is set we are ready to move ahead.

Preprocessing Data

Here we have a parallel data set which is labeled as to source(src) and target(tgt), which are present in the data folder of the repo. The dataset has one sentence per line and every word or token is separated by a space. And we have source(src) and target(tgt) training, test and validation files in it.

  1. Training files has 10k sentences.
  2. Test files have 2737 sentences.
  3. Validation files have 3k sentences.

script to run preprocessing:

python -train_src data/src-train.txt -train_tgt data/tgt-train.txt -valid_src data/src-val.txt -valid_tgt data/tgt-val.txt -save_data data/demo

This will yield :

  • serialized PyTorch file containing training data
  • serialized PyTorch file containing validation data
  • serialized PyTorch file containing vocabulary data

The main training script is simple. The main script will run a default model, comprising a two-layer LSTM possessing 500 hidden units for both, the encoder and the decoder. To use GPU for training just add the parameter — “gpu_ranks=0”. We need to pass two parameters for training. First, the output folder of data preprocessing and second the output folder where the model file can be saved.

python -data data/demo -save_model demo-model

If you want to use a transformer model then look in faq of the OpenNMT.

To use the trained model for translation, we can use the script below. Since we have only a small dataset, the translation quality isn’t superior.

We need to provide the path of the trained model and test file containing sentences that you want to translate.

python -model -src data/src-test.txt -output pred.txt -replace_unk -verbose

We evaluate the machine translation models by calculating the BLEU (bilingual evaluation understudy). Machine Translation Quality is considered to be the correspondence between a machine’s output and that of a human: “The closer a machine translation is to a professional human translation, the better it is” — This is the central idea behind BLEU. BLEU was one of the first metrics to claim a high correlation with human judgments of quality, and remains one of the most popular automated and inexpensive metrics. It is fast and language independent.

BLEU score always lies between 0 to 1, where 0 means total mismatch and 1 means a perfect match. So, a machine translation model is evaluated on its BLUE score. The better the model, the higher the score.


· Machine translation is nothing but sequence to sequence (Seq2Seq) modelling

· In this article, we’ve understood the challenges in traditional techniques of Machine Translation in terms of understanding context.

· We saw that Neural Machine Translation was able to understand sequences and context by deeply understanding the language.

· We thoroughly understood the NMT system and its architecture containing an Encoder and a Decoder, that use LSTMs for retaining context and understanding longer sequences.

· We then implemented our own NMT system using the open source toolkit OpenNMT-py which is built for Pytortch, for English to German translation. However, this approach can be used for translation into any language, given there is a parallel corpus.

While we understood that LSTMs based sequence to sequence modelling can give us good results for Language Modelling and Neural Machine Translation compared to RNN based models, the attention-based Transformer architecture yields even better results in comparison.

I would certainly advice giving “Why transformers yield better sequence to sequence results” a look as well.

For more articles on NLP, Deep Learning, Virtual Assistants and more, follow our blog and social media.

Seq2Seq with Attention and Beam Search



Neural Machine Translation-Tutorial ACL 2016

Understanding LSTMs

Omnichannel Multilingual Conversational AI for Enterprises Worldwide

Akash Singh

Written by

Conversational AI Engineer

Omnichannel Multilingual Conversational AI for Enterprises Worldwide

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade