Textual data augmentation with back-translation

Dmitry Yemelyanov
Riga Data Science Club
2 min readSep 11, 2020

You are probably familiar with numerous computer vision augmentation techniques like image random flipping, rotating, cropping. In Natural Language Processing domain data augmentation is a much more challenging task!

Let’s get familiar with one of the powerful text augmentation techniques: Back Translation. In this approach, one makes use of machine translation to paraphrase a text while retaining the meaning.

The back-translation process is following:

  • Take some sentence and translate to another language
  • Translate the output sentence back to original language
  • Check if the new sentence is different from the original sentence. If it is, then we use this new sentence as an augmented version of the original text.
Back-translation with a single language: French

In case the sentence is still the same, you can take advantage of several intermediate languages, for example:

1. English: Riga is a beautiful city near the Baltic Sea

2. Latvian: Rīga ir skaista pilsēta netālu no Baltijas jūras

3. Russian: Рига — красивый город на берегу Балтийского моря

4. Back to English: Riga is a beautiful city on the shores of the Baltic Sea

Back-translation with multiple intermediate languages

Having this process automated for your training set you might achieve much better performance of your NLP model!

--

--

Dmitry Yemelyanov
Riga Data Science Club

Founder at Riga Data Science Club | Machine Learning Consultant