Bridging the Language Gap With Neural Translation of Videos, Images and Text

Published in

Yandex

7 min readJun 22, 2022

Translating images

Offering automated translation of plain text on web pages through a browser is helpful, but it doesn’t cover the whole amount of information published online. Governmental organizations in Israel, for example, prefer to share their information as images, as do many sites in Korean, Chinese, and Arabic. Technical characteristics of products sold in online stores are often posted in the image format, too.

Translating text within images requires a pipeline of three technologies. First, computer vision should find and recognize text in the image through a process known as optical character recognition, or OCR. Machine translation should then translate the text. And finally, the translated text should be rendered on top of the original using a rendering technology.

A simple implementation of this pipeline would be to take the original image, upload it in its original format from Yandex Browser to a Yandex server, do all the work on the server side, and return the new version of the image with the rendered translation to the user. This would be the easiest for Yandex, but not so convenient for the user — images can be quite large, sending them back and forth consumes traffic and time, which ultimately clogs up the user experience.

With the users’ interest in mind, we chose a more sophisticated way. Yandex Browser shrinks the images, makes them black-and-white, and converts them to the WebP format, which on average is 15–20% more compact than JPEG. In combination, these measures make it possible to significantly reduce image size, without considerable losses in the quality of text recognition and translation.

We also moved the step of overlaying the translation back onto the image to the user’s device. The problem was that the rendered image with translated text had to be laid over the original color image stored in Yandex Browser. The browser cannot distinguish text from the background in the original image, and therefore cannot select the proper color for the translated text. Our server-side OCR can recognize the text, but it cannot tell what color it was before the image was converted to black-and-white.

To solve this problem, we highlighted key background and text points in the image on the OCR side, and then sent their coordinates along with the translation to Yandex Browser. The browser then could use these coordinates to determine the colors and select the right one for the translation to overlay against the background.

This is what the end result looks like:

You can check out our image translation technology at work in Yandex Browser on a desktop computer or Android device, as well as in the Yandex App.

Translating videos

Video streaming is becoming more and more popular, with Russians watching ever more educational and popular science videos, as well as interviews, news reports, and other content. Most of these videos are in languages other than Russian, and professional translation is seldom on offer for fresh online content. At best, users get automatically generated subtitles. To meet this growing demand for international video streaming, we’re now working to provide automatic translation and voiceover for videos right in Yandex Browser.

Just as with images, machine translation alone is not enough to break the language barrier for our users. Video translation quality strongly depends on the quality of speech recognition and synthesis. Our work on these technologies for the AI-based conversational assistant Alice helped us to quickly implement them for video translation in Yandex Browser.

Step 1. Speech recognition and text pre-processing

Our input is a video with some voiceovers. It might be an educational video with one speaker, a two-person interview, or even a discussion with multiple voices involved. Converting unstructured speech into text will result in a structureless sequence of words — no commas, full stops, or logical grouping of words into sentences, or sentences into paragraphs. And if we ran such text through a machine translation algorithm, the result would be pure GIGO (garbage in, garbage out). This is why we don’t just convert speech into text without also applying a special neural network that cleans out the junk, groups words into semantic segments, and inserts punctuation marks.

When translating videos, we rely not only on the voice, but also on subtitles. We don’t apply our speech recognition technology to those videos that already have subtitles. A text written by a person usually has a better quality than something produced by an automatic speech recognition technology, also known as ASR. If the subtitles were generated automatically, however, then we ignore them and use speech recognition.

Subtitles, even if they have been written by a person and added manually, must be processed by a neural network to remove anything that may be confusing during voice synthesis, such as description of sounds (applause, sirens, etc.) or the speakers’ names before their words.

Manual subtitles may also be broken into arbitrary segments rather than logical phrases, so we have to make sure the generated speech follows the correct meaning which might be spread across multiple lines.

The two lines of subtitles in this screenshot have been segmented incorrectly. Automated translation technology would have translated these lines as they are, but our neural network reconstructs the actual phrase segments from the context and the final result for the speech synthesis looks like this:

So this is pretty cool.
This is actually a diagnostic technique.

Step 2. Biometrics

Now that we have correct text segments with timestamps, we determine the speaker’s gender for each part of the text to apply an appropriate synthesized voice. Translated voiceover is easier to follow when speakers have different voices. Moving forward, the synthesized voice will match not only the speaker’s gender, but also their personal tone, pitch, timber, and other characteristics. Our speech synthesis technology currently supports two voices — male and female — with more in the works.

Step 3. Machine translation

This step is quite standard, but with one important distinguishing feature. In addition to many other factors, we add information about the speakers’ gender to the translation model to ensure correct grammar in the final result. Thanks to this information, this is what a phrase “I found a wonderful spouse” said respectively by a bride and a groom will look like in Russian:

Bride:
Я нашла замечательного супруга.
Ya nashla zamechatelnogo supruga.
Groom:
Я нашёл замечательную супругу.
Ya nashol zamechatelnuyu suprugu.

The ending of every word in the sentence except Ya (“I”) changes depending on who is talking (female bride or male groom) and who they are referring to (male groom or female bride). Word forms in many languages vary depending on the gender of the speaker and the person or object they are referring to. Feeding this information to the translation model ensures that the speakers continue to refer to themselves and address others correctly in the translated version.

Step 4. Speech synthesis

When synthesizing translated speech, the length factor must be taken into account. Russian texts are longer than English texts in general. The difference can range on average from 10% to 30%, meaning that in longer videos, there’s a risk that the English speaker and our Russian voice can get seriously out of sync. To synchronize the two speech streams, instead of simply accelerating the Russian audio track, we use the timestamps that we create when analyzing the original speech. These timestamps let us know which phrases are pronounced at specific points and help us sync the speech streams more accurately.

Speech synthesis is a complex process with two major steps. First, neural networks create a spectrogram, a visual representation of sound frequencies, for each phrase. Second, other neural networks convert the spectrograms into sound. Timestamps help us generate a spectrogram of the right duration at the first step. The spectrogram of the translated speech is then shrunk to match the original, primarily by removing unnecessary pauses between phrases and words, and only when this isn’t sufficient does the algorithm speed up the speech.

Step 5. Notifications

Video translation is a cascade of resource-intensive technologies that run in a sequence. It takes time to run the huge transformer neural networks, even if they work in parallel across multiple GPUs. Translating an hour-long video initially took us half an hour. We’ve successfully optimized the whole process and sped up the translate engine significantly, but the entire process still takes minutes rather than split seconds.

As we continue to work toward instant translation, in cases when it takes longer than seconds, we offer our users to notify them when their translation is ready. After requesting a voiceover translation for their chosen video, they can close the tab and return to the page when Yandex Browser notifies them that their video is available for watching.

Video translation pipeline

Watch a translated sample of Wikipedia co-founder Jimmy Wales’ lecture at Yandex (the original is here). This fragment illustrates both the potential of our technology and the challenges yet to be faced.

Our goal is to help people overcome the language barrier and discover new, interesting content where no professional translation is yet available. We will continue improving our video translation engine to expand opportunities for our users.

Synchronous translation is currently available for English videos on popular platforms, such as YouTube, Vimeo, Facebook, and others. It works in the desktop and Android versions of Yandex Browser, and in the Yandex App for Android and iOS.

The Yandex Translate API is available via the Yandex Cloud platform. It supports glossaries and lets customers improve the quality of machine translation by using their own data to train the model.

Yandex Translate continues to innovate, including by adding new language pairs and expanding opportunities for business partners and clients. To join us on this exciting journey, participate in our pilot projects or simply share your ideas or feedback, contact the Yandex Translate team at videotranslate@yandex-team.com.