Realtime Translated Subtitles

Tarek Madany Mamlouk
Axel Springer Tech
Published in
6 min readDec 14, 2020

--

Written by Saidusmon Oripov and Tarek Madany Mamlouk

This November, Axel Springer held its first fully virtual tech-conference. We are an English speaking company but with our headquarters in Germany most of the conference’s participants were Germans. The smaller sessions were organized in MS Teams, where you can enable real-time subtitles. This can be really helpful if you are struggling with the spoken language but it would even be better if the live generated subtitles were directly translated into your preferred language. We don’t have that? Let’s build it!

Choosing the right Technology

There are some critical requirements for this kind of application. We need small to no delay, delivery of intermediate results, and assembly of semantically sound sentences.

There are currently three popular types of machine translation systems in the market: neural, statistical, and rules-based. Over the past few years, big technology companies like Google, Amazon, Microsoft, Facebook, and IBM have been transitioning from old-fashioned phrase-based statistical machine translations to neural machine translations. The main reason is that the new technology started to show better translation accuracy performance. And according to a study done by Tilde, a neural machine translation system handles word ordering and morphology, syntax, and agreements up to five times better, respectively, than the statistical machine translation system.

Visualization of data from tilde.com

Hard-to-translate content like acronyms, jargon, slang, industry terminology, and cultural differences are critical for getting accurate translations, and it remains a big challenge for machine translation. However, rapid advances in machine intelligence have improved the recognition of speech and image capabilities, continuing to drive up quality. And they are increasingly getting employed in diverse business areas, introducing new applications and enhanced machine-learning models. Large organizations are moving to machine learning to augment their workloads to make their content more accessible faster than it would be possible without automation.

We decided for now to implement our solution on the basis of Google’s new Media Translation API. This service is currently in its beta-phase and offers therefore limited support. Since this is a prototype, working with the beta-release was totally fine.

Google’s approach for Media Translation uses bidirectional streaming RPCs for moving data between client and server and vice versa. Both streams act independently so that the server can decide to answer requests immediately or wait for enough information before sending a consolidated response.

In gRPC, the client can set a timeout for the completion of its calls. Defining timeouts is language-specific and might require setting a duration for the call or a fixed point in time as a deadline. Termination of calls can happen independently on each side without any necessary dependency on the outcome of the other stream.

In our case, we implemented a slim client in Node.js based on Google’s media translation SDK. The hardest part was actually getting access to the device’s microphone via Sox. While this worked on one device without any problems, other devices ran into problems.

Components of our real-time translation overlay

Building a Prototype for Production

We used this real-time translator on the Axel Springer TechCon 2020, an online conference with hundreds of participants sharing knowledge about tech-related topics. We introduced this tool as a surprise addition to our talks. Even though we are an English speaking company, the majority of the conference’s participants was german. Having German subtitles in real-time was a nice feature.

Display German subtitles in real-time while speaking English

Integration into our talks was easy because we managed our stream via OBS. For a clean overlay on OBS, which shows the subtitles’ always updated status while the speaker is speaking, we wrote a small application in React with Event Source. This way, our client subscribes to the updates on the Node.js server and refreshes the display momentarily. The chroma-key filter in OBS allows us to generate a transparent overlay on top of our video so that the viewer can see the subtitles simultaneously while the speaker is on the stage.

Do we need this?

Axel Springer is a globally connected publishing house, and to scale its business; it needs to attract a diverse audience. But getting ahead with the exchange of information in multiple languages at scale is a growing challenge. Especially so when more work is moving to virtual format.

As digitalization becomes more widespread in various industries, the demand for automated machine translation market will increase. Also, we can expect that the ongoing pandemic will have a positive impact on the market. In 2019 the machine translation market was valued at USD 550 million, and it’s expected to reach USD 1.5 billion by 2026 (see marketwatch.com and prnewswire.com, both agree on estimated market valuation).

Visualization of data from prnewswire.com

It’s easy to presume that the world is becoming more and more fragmented, e.g., trade wars, tariffs, populism. But in-spite-of the backlash, the corporate world has never been so connected. The largest organizations worldwide embracing a more significant push for globalization and expanding their services internationally because they see an increasing value in delivering their products and content globally. That being said, becoming a genuinely global business brings its own challenges. Companies that fail the digital transformation can’t keep pace with globalization, and they are losing competitiveness.

So the question is not if we need this but rather if we can afford NOT to use this.

Where do we go from here?

Reactions to this experiment were mixed. First, everyone was flashed and impressed that a talk can be translated in real-time. Even though real-time subtitling was already around for a while, and everyone knows translators, seeing these subtitles on the fly was impressive. Then came amusement because some of the subtitles made no sense at all. If the speaker speaks slowly and clearly, the translations were perfect. But sometimes, the speaker stutters, mumbles, or speaks very fast. Then the translator started guessing, and it was hilarious. Unfortunately, this made the audience question the value of this tool because information got lost or was misrepresented. It’s not easy to decide if this particular prototype was a success or a failure. Yes, the translations were no perfect translation of the spoken content, but what exactly is our expectation?

I personally see this implementation as successful because it proves that software can provide some kind of real-time translation for helping people understand the spoken word. What I question is the way the subtitles are displayed. There was too much text pouring through the screen for people to follow structurally. A next iteration could be condensing the real-time translations into keywords or abbreviated sentences to give the viewer context and partial translations, assuming that he has a basic understanding of the given talk. This experiment includes a design element (displaying the text in chunks that are easy to read) and a language processing element (extract a semantic excerpt from a sentence). If we can manage that in real-time, then we have a unique and stylish solution for a very old problem.

--

--