Textual data augmentation with back-translation

Published in

Riga Data Science Club

2 min readSep 11, 2020

You are probably familiar with numerous computer vision augmentation techniques like image random flipping, rotating, cropping. In Natural Language Processing domain data augmentation is a much more challenging task!

Let’s get familiar with one of the powerful text augmentation techniques: Back Translation. In this approach, one makes use of machine translation to paraphrase a text while retaining the meaning.

The back-translation process is following:

Take some sentence and translate to another language
Translate the output sentence back to original language
Check if the new sentence is different from the original sentence. If it is, then we use this new sentence as an augmented version of the original text.

Back-translation with a single language: French

In case the sentence is still the same, you can take advantage of several intermediate languages, for example:

1. English: Riga is a beautiful city near the Baltic Sea

2. Latvian: Rīga ir skaista pilsēta netālu no Baltijas jūras

3. Russian: Рига — красивый город на берегу Балтийского моря

4. Back to English: Riga is a beautiful city on the shores of the Baltic Sea

Back-translation with multiple intermediate languages

Having this process automated for your training set you might achieve much better performance of your NLP model!

Textual data augmentation with back-translation

Written by Dmitry Yemelyanov