Creating your own Language Translator

7 min readOct 26, 2018

Well, sometimes I think you truly understand a concept if you could implement it (might not always be possible but we can still try). So I recently had the opportunity to work on Neural Machine Translation (NMT). To all those you haven’t heard of this term, there is a high chance that you still have used it, when you type a sentence in Google translator or see a tweet translated to your language on twitter or you open a website and google chrome suggest to translate it, you get the idea. And for all those who prefer precise explanation be my guest and visit this wiki page. So, when you already have these available then why do you want to create your own 🤔. Well, you may be curious to have your translator (which is incredible), or you might want a translator for a language pair not served by current applications (This article is perfect for you!), or maybe some other reason. I will detail out the step-by-step method for it.

Before diving into technical details, I am assuming that you have an interest in machine learning and new to NMT. Pardon me for it, but if I did not do this, then it will make the article unnecessary boring. So let’s get our hands dirty.

Neural machine Translation is supervised sequence to sequence modelling. It is because we have an input sequence (of words) also called source in Machine Translation and have output sequence (of words) as our target. Sequence to sequence is a vast area in itself and hence not digressing the topic we will focus on our problem of NMT. To teach (or train) our neural networks we will need pairs of a sentence in source and target languages. Corpus with text placed alongside its translation referred as a parallel corpus. Now if you want to train a translator for fun, then you can find and download a parallel corpus with a few searches, eg European Parliament Proceedings Parallel Corpus. You can jump to Step 4, to dive into stitching the neural nets. But if you want to train in a new language, then creating the dataset is the key to the reliability of your translator.

Step 1: Getting text from different languages.

Lets say you want to create a translator from Language X→ Language Y. You collect a “big” text in X and its translation to Y. Good news, 1st step complete and another good news ;) time for next step.

Step 2: Sentence Separator

Splitting the two text files into sentences. This is a requirement for the training. For this, either you can use already available sentence separators called sentence tokenisers, or you can create your own. Constructing a sentence tokeniser is easy. A sentence tokeniser find the end of the sentence (E-O-S)

Image courtesy: Stanford NLP Course slides

The above decision tree explains the logic for English. You can create a similar decision tree for your languages and code the logic using if else conditions. But to save time, I would suggest you take a look at the existing sentence tokenisers. For example, for the English language, you can use python’s NLTK library for sentence tokeniser. Code block below:

>>> from nltk.tokenize import sent_tokenize>>> corpus = 'Hi, Sentence 1 is here! Sentence 2 as well. Sentence 3 are you here? Lets just move to sentence 4 then.'>>> sentences = sent_tokenize(corpus)>>> sentences
['Hi, Sentence 1 is here!', 'Sentence 2 as well.', 'Sentence 3 are you here?', 'Lets just move to sentence 4 then.']

Text broken down into sentence…now comes the bad part.

Step 3: Parallel Corpus

Now as mentioned earlier, you will need to create pairs of (X, Y) sentences. This will be necessary to be careful. If the pairs are garbage, then your translator will also output garbage. The only way I know is to do that manually. That makes datasets a significant work in themselves, and hence they deserve more respect. If you come across, some better way then do let me know. I will also learn something new.

Once you create pairs of (X, Y) sentences you should make multiple copies, save it to Dropbox, mail it to yourself, keep it in a pen drive what I mean is don’t lose it. With this, you can start with your neural networks.

Note: This fascinating paper by Facebook AI Research lab doesn’t require this step at all. But I haven’t implemented that so won’t push you either.

Step 4: Vocabulary

Well, like humans, machine (or our model) will also have vocabulary size. This vocab restriction is done to reduce the complexity of training and also because of the memory requirements while using a trained model. Hence we generally construct a target vocabulary of k words. A good value for k is between 30,000 to 80,000. For words not in our vocabulary, [UNK] tokens are predicted, i.e. unknown word. More about this in this arxiv paper. So now you have one more text file which contains shortlisted words/ vocabulary.

Note: Library we are using for the model will take care of vocab.

Step 5: Tokenizer

People open your editors; we are starting with our script. Input the dataset and divide the pairs into source sentence and target sentence. Now let’s create a model, and I am using the attention mechanism in this. Attention mechanism was a breakthrough in NMT. An intuitive explanation is, the source words to pay attention by the network when generating a word in the target sentence. See below image to understand better.

This animation is taken from a Google documentation on seq2seq.

In the above animation, the width of the lines (in purple) connecting the source word and target word is because of the attention mechanism. Earlier techniques in this domain used to generate a single vector from the source sentence and generated a target sentence from that vector. I quote Ray Mooney (Professor of CS at UT, Austin)…

“You can’t cram the meaning of a whole %&!$ing sentence into a single $&!*ing vector!”
-Ray Mooney

Now the general question which comes to mind is how to get the words from a sentence. A naive approach could be to split a sentence based on white spaces. But this might be a decent approach for languages like English and French which have words separated by spaces but is a bad idea for languages like German where two words are combined while writing. To demonstrate the difficulties involved even in languages like English, see these two sentences, which do you think is correct (or a better word tokenised):

Mr. O’Neill thinks that the boys ‘ stories about Chile ‘s capital are n’t amusing
Mr. O’Neill thinks that the boys ‘ stories about Chile ‘s capital aren’t amusing

Well, many of the languages have a word tokeniser available and for our project, we will Moses tokeniser. There even are scripts specify to your concerned language and it’s better to use them else for quick prototype use Moses. There is a Perl script for that, and you can run that script as follows:

Usage ./tokenizer.perl (-l [en|de|...]) (-threads 4) < textfile > tokenizedfile
Options:
-q     ... quiet.
-a     ... aggressive hyphen splitting.
-b     ... disable Perl buffering.
-time  ... enable processing time calculation.
-penn  ... use Penn treebank-like tokenization.
-protected FILE  ... specify file with patters to be protected in tokenisation.
-no-escape ... don't perform HTML escaping on apostrophy, quotes, etc.

As an example, I would be using French -English dataset from European Parliament Proceedings Parallel Corpus. There are two files, one has English sentences, and other has French sentences. E.g. of the dataset

English — French
Resumption of the session — Reprise de la session
Please rise, then, for this minute’ s silence. — Je vous invite à vous lever pour cette minute de silence.

Step 6: OpenNMT

To make life a bit simple, I suggest you can use OpenNMT library. The reason is coding the entire pipeline is time-consuming, and you have to perform good number of iterations to remove bugs. Now there are other options available, but I have worked with OpenNMT library and found it to be useful. If you have any issue, then you can get the answers easily either in documentation or their Gitter channel. This library will automatically create vocab file, and so you don’t have to worry about it. Codes used are explained in their respective git repository, but if you still have questions feel free to open an issue.

Follow the code steps or go to my GitHub Gist: