Translatotron by Google AI

Published in

Voice Tech Podcast

3 min readJun 4, 2019

Google AI launches an End-to-End Speech-to-Speech Translation Model.

Time and again people traveling to a foreign country often come across a language they are not familiar with, eventually face a lot of difficulties through out their stay.

This is where speech-to-speech translation comes handy in.

Speech-to-speech translation systems have been developed over the past several decades with the goal to help people speaking different languages in order to help communicate with each other.

Such systems have been broken into three separate components:

Automatic speech recognition to transcribe the source speech into text,
Machine translation to translate the transcribed text into the target language, and
Text-to-speech synthesis (TTS) to generate speech in the target language from the translated text.

Various commercial speech-to-speech translation products (ex. Google Translate) have been running on a similar concept (i.e) dividing the tasks into these 3-tiers and cascading them, showed great results.

Now, its no secret that Google is proactively integrating Artificial Intelligence into its products for making it more efficient and customer-friendly.

Previous month, Engineers at Google AI launched Translatotron, an End-to-End Speech-to-Speech Translation Model.

Build better voice apps. Get more articles & interviews from voice technology experts at voicetechpodcast.com

Translatotron: The emergence of end-to-end models on speech translation initiated in 2016, when researchers demonstrated the feasibility of using a single sequence-to-sequence model for speech-to-text translation.

In 2017, Google AI demonstrated that such end-to-end models can outperform cascade models.

A lot of research has been going on to further improve end-to-end speech-to-text. As a result, lot many approaches have been been proposed recently, including leveraging weekly supervised data.

Translatotron goes a step further, by showing that a single sequence-to-sequence model can instantly translate speech from one language to another, without having to depend on an intermediate text representation in either of the languages, as was required in cascade systems.

The new end-to-end speech translation model works on two differently trained components:-

Neural vocoder: It converts output spectrograms to time-domain waveforms.
Speaker encoder: It maintains the source speaker’s voice in the synthesized translated speech.

The Engineers of Google AI validated the translation quality by Translatotron by measuring the BLEU (Bilingual Evaluation Understudy) score, computed with text converted by a speech recognition system.

Feel free to checkout the audio samples here and here.

Though the results lag behind a conventional cascade system but the demonstration of the feasibility of an end-to-end direct speech-to-speech translation was remarkable.

Conclusion: According to my research and various other sources, Translatotron is the first end-to-end model that can directly translate speech from one language into speech in another language.

It is also able to retain the source speaker’s voice in the translated speech.

This work can serve as a source point for future research on end-to end speech-to-speech translation systems.

Thank you for sticking till the end. :)

Do share if you found it useful. Feel free to express your thoughts in the comment section below.

Translatotron by Google AI

Written by Vaibhav Kumar