Yandex Browser Live Stream Translation: Principles and Differences to On-Demand Video Dubbing

Sergey Dukanov
Yandex
Published in
9 min readAug 5, 2022

We already talked about how automated translation and dubbing of videos works in Yandex Browser. Users watched 81 million videos with voice-over translation in the first ten months after release. The mechanism works on request: as soon as the user hits the button, the neural network receives the entire audio track, and the dubbed translation to the user’s language appears after a few minutes.

But this method is unsuitable for live broadcasts, where you need to translate practically in real-time. That’s why we just launched a separate, more complex live stream translation mechanism in Yandex Browser. Device announcements, sports competitions, inspiring space launches — all of this and plenty of other content can now be viewed in the target language live. The production version currently only supports translation to Russian — with English coming this fall. Also at this moment, voice-over translation is available for a limited set of YouTube streams: you may find the full list at the end of this article. In the future we will of course open the functionaly to all YouTube live videos. We had to rebuild the entire architecture from the ground up to adapt the translation mechanism for streams.

How Live Stream Translation Works

From an engineering standpoint, translating and dubbing live streams is an arduous task. Two contradictory requirements collide here. On the one hand, you need to feed the model as much text as possible at a time to ensure that the neural network understands the context of each phrase. On the other hand, it is necessary to minimize the delay; otherwise, the “live broadcast” will cease to be such. Therefore, we must start translating as soon as we can: not in a proper simultaneous interpreting fashion, but very close to it.

We engineered a new service based on existing algorithms to launch a fast, high-quality live stream translation and dubbing. The new architecture made it possible to reduce latency without losing out on quality too much.

In a nutshell, the live stream translation’s principle of operation boils down to five ML models. One neural network is responsible for speech recognition of an audio track and converts it into text. The second engine identifies the speakers’ genders. The third splits the text into sentences by placing punctuation and determining which parts of the text comprise complete thoughts. The fourth neural network translates the received pieces. Finally, the fifth model synthesizes speech in the target language.

It looks simple on paper, but there are a lot of pitfalls once you dig deeper. Let’s explore this process in more detail.

The Building Blocks of Live Stream Translation in Yandex Browser

At the beginning stage, you need to understand precisely what is being said in the broadcast and determine when the words are pronounced. We don’t just translate speech but also superimpose the result back on the video at the right moments.

Deep learning is a perfect solution to the problem of ASR (Automated Speech Recognition). The neural network architecture should allow for a live stream usage scenario when it’s necessary to process audio as it arrives. Such a limitation may affect prediction accuracy, but we can apply the model with some delay (a few seconds), which gives the model some context.

Videos may contain extraneous noises and music. Besides, people may have varying diction or speak with different accents and speeds. There may be many speakers, and they may shout rather than talk at a moderate volume. And, of course, you need to support a rich vocabulary because there are a lot of possible video topics. Thus, data collection neccessary for training plays a key role.

At the input, the algorithm receives a sequence of audio pieces, takes N of them from the end, extracts acoustic features (a MEL spectrogram), and feeds the result as input to the neural network. It, in turn, gives out a set of sequences of words (so-called hypotheses), from which the language model — a text-specific part of the neural network—selects the most plausible hypothesis. When a new piece of audio arrives, the process repeats.

The resulting sequence of words needs to be translated. The quality will suffer if you translate word by word or phrase by phrase. If you wait for a long pause, which signifies the end of the sentence, there will be a significant delay. Therefore, it is necessary to group words into sentences to avoid loss of meaning or sentences that are too lengthy. One way to solve these problems is to use a punctuation recovery model.

With the advent of transformers, neural networks are much more capable of understanding the meaning of the text, the relationship between words, and patterns of language constructions. You only need a large amount of data. For restoring punctuation, it’s enough to take a text corpus, submit text without punctuation to the neural network input, and train the network to fix it back up.

The text is sent to the neural network input in the tokenized form; usually, these are BPE tokens. Such a split is not too small to prevent the sequence from getting long but also not too large to avoid the out-of-vocabulary problem when the token is not in the glossary. At the model’s output, each word has a succeeding label that marks which punctuation symbol should be put there.

You must set some limited context to ensure proper work in live streaming conditions. The size of this context should strike a compromise between quality and latency. If we are unsure whether breaking into sentences at this particular point is necessary, we can wait a little longer until new words come in. Then we will either better define the partitioning or exceed the context limit and be forced to split where we are only somewhat sure.

For correct translation and high-quality dubbing, you need to determine the gender of the speaker. If you use the gender classifier at the sentence level, there will be no differences in the live streaming scenario compared to on-demand. Storing a history of every speaker’s voice lines helps us deliver more precise gender classification. It reduces the error rate by one and a half times. Not only can we determine the gender of a person from just one phrase, but we also consider the results of gender classification for the phrases uttered previously. To do this, we need to determine who the line belongs to on the fly, thereby clarifying the gender of the speaker.

From machine translation’s point of view, nothing has changed compared to the translation of ready-made videos, so we will not delve into this right now. In the past we’ve covered the inner workings of how translation works.

The basic synthesis technology in Alice, Yandex’s smart assistant, is similar to the one we use in video translation. The difference is in how the application (inference) of these neural networks is carried out. The speaker in the video can utter a remark very quickly, or the translation of the sentence may turn out twice as long as the original. In these cases, you’ll have to compress the synthesized audio to keep up with the timing. This can be achieved in two ways: at the sound wave level, for example, using PSOLA (Pitch Synchronous Overlap and Add), or within the neural network. The second method produces more natural-sounding speech but requires the ability to edit hidden parameters.

It’s essential not only to bring the durations of synthesized phrases to the desired length but also to decompose them at the right moments. It will not always be perfect: you’ll either have to speed up the recording or shift the timings — the stacking algorithm is responsible for this. In live stream broadcasting, you can’t change the past, so you may get a situation where you need to voice a phrase twice as fast as it’s pronounced in the original video. For reference: acceleration by more than 30% significantly affects human perception.

The solution is as follows: we reserve some time in advance. We aren’t in a hurry to stack voice lines and can wait for new ones to account for their duration. We can also allow a little time shift to accumulate since sooner or later, the video will have a few seconds of silence and allow the shift to reset to zero.

The resulting audio track is cut into fragments and wrapped in an audio stream that will be mixed locally in the Browser client itself.

The Architecture of the Live Stream Translation Service

When you watch a broadcast, the Browser polls the streaming service (for example, YouTube) for new fragments of video and audio; if there are any, it downloads and plays them sequentially.

As the user clicks on the live translation button, Yandex Browser requests a link to the stream with translated audio from its backend. The Browser overlays this track on top of the main one while respecting timings.

Unlike ready-made videos, a live stream is processed by machine translation every moment of its existence. Stream Downloader reads the audio stream and sends it to the processing ML pipeline, the components of which we have analyzed above.

There are several ways to organize interaction between components. We settled on the option with message queues, where each component is designed as a separate service:

  • It’s problematic to run all models on the same machine: they may simply not fit in memory or require a very specific hardware configuration.
  • It’s required to balance the load and be able to scale horizontally. For example, machine translation and voice synthesis services have different throughput capacities, so the number of phrases may differ.
  • Services sometimes crash (GPU running out of memory, memory leak, or power outage in the data center), and queues provide a retry mechanism.

The stream is not anchored to a single instance, but some context (background) may be required for processing. For example, the synthesis needs to store recordings that it hasn’t yet put on the final audio track. Hence, there’s a need for a global context repository for all streams. In the diagram, it is designated as Global Context — in essence, it’s just in-memory key-value storage.

Finally, the received audio stream needs to reach the user. Here, the Stream Sender takes over: it wraps audio fragments into a streaming protocol, and the client reads this stream from a link.

What’s Next

Currently, we provide live stream translation with an average delay of 30–50 seconds. Sometimes we fly out of this range, but not by a lot: the standard deviation is about 5 seconds.

The main difficulty in live stream translation is guaranteeing that the delay doesn’t fluctuate too much. A simple example: you open the live stream and, after 15 seconds, begin receiving the broadcast. If you continue watching, sooner or later, one of the models will end up needing more context — for example, if a speaker utters a long sentence without pauses, the neural engine will try to obtain the entire thing. Then the delay will increase by perhaps ten additional seconds. Naturally, a little more delay at the start is preferred to prevent this from happening.

Our global goal is to reduce the delay to approximately 15 seconds. It’s a little more than in true simultaneous interpretation but enough to cover live streams where the hosts interact with the audience, such as on Twitch.

While a complete access to all YouTube streams translation is in the works, here is the list of channels where dubbing is already available:

Apple
Business Insider
CNET Highlights
English Speeches
Freenvesting
thegameawards
Google
Google Developers
IGN
NASA
The Overlap
SpaceX
TechCrunch
TED
TEDx

--

--

Sergey Dukanov
Yandex
Writer for

Building AI technology breaking down language barriers.