Empower Subtitles — Generates subtitles for streams (Hackdays Project)

Victor Wu
Zattoo’s Tech Blog
8 min readNov 12, 2021

Zattoo Hackdays 2021 successfully ended in early October. This is a great chance for Zattooies to show our innovative and creative ideas. In the last few years, we’ve seen amazing Hackdays projects like Personalized Playlist, Zattoo Kids and others. While we appreciate all the wonderful projects that get done in 2 days, there are also chances to continue to work on those good ideas, especially the winning team. I’m glad to see all the new projects in this year’s event and looking forward to seeing more interesting projects in next year.

Wonderful trophies made in Zattoo (Lucia &Jerzy Michał Jurczyk)

But today, instead of the new projects, let’s take a chance to look into one of the Hackdays projects from last year, Empower Subtitles. Let’s go into more detail what we did and how we did in this project.

What is Empower Subtitles?

As it is named, it is about empowering the subtitles experience when watching our streams. Although subtitles are widely seen in movies and other VOD contents, they are not always available in live streams. And even if subtitles are available in movies and VOD content, sometimes that doesn’t mean users can easily benefit because the languages provided are usually limited and might not have the user’s preferred language.

Why are subtitles important?

To answer this question, let me ask you some questions.

  1. Have you ever tried to learn a new language by watching TV and movies but without subtitles it’s too difficult?
  2. Have you ever watched a movie that you couldn’t fully understand but it became a lot easier after switching the subtitles on?
  3. Have you ever found interesting shows or movies but you didn’t know the language?
Crying when looking at “no subtitles available” 😭

If you answered “yes” to any of these questions then you know first hand the importance of subtitles. In fact, you are probably not alone. Besides those scenarios, there is one even more important use of subtitles: improving accessibility.

According to a 2006 Ofcom subtitles study¹, there are 7.5 million users using subtitles in the UK, which is 18% of the population. Among those 7.5 million, 1.5 million are deaf or hard of hearing users. According to WHO data, this group is very likely to double (from 466 million to 900 million) in 30 years².

But probably the more interesting part of the study is 80% of subtitles users are not with any hearing impairment, so why do these many users use subtitles? Here are some of the reasons/benefits of using subtitles:

  1. Easier to follow for users watching in a second language, like English, German, etc.
  2. Better comprehension when dialogue is spoken very quickly, with accents, mumbling, or background noise, etc.
  3. Clearer for names, full names, brand names, technical terminology, etc.
  4. Allowing users to watch in more places with sound-sensitive environments like libraries
  5. Better experience for different users: hard of hearing, learning disabilities, attention deficits, etc.

With all the benefits and scenarios, it’s clear that enhancing subtitles can improve user experience. We can also see that companies like Google are enhancing captions support for audio and video³.

What’s the plan?

Our plan is to increase the subtitles coverage by translating existing subtitles and generate subtitles using speech-to-text. Translation is easier to achieve a better experience in most time. Because translation is more mature technology and also the existing subtitles provided from shows and movies could be a good base for the translation. Subtitles can’t be generated for all streams because not all streams have original subtitles. Therefore, we also include speech-to-text to expand the coverage as much as possible. Unfortunately, speech-to-text is still very unstable, especially for content that has more than one person in the conversation or even at the same time. Therefore, our primary Hackdays goal is limited to simply enhancing subtitles using translations. Speech-to-text is a good-to-have part in the prototype (and we made it!).

For the UI we also want to allow for 2 subtitles to be displayed to assist those users who want to see their own language beside the new language they’re trying to learn.

How did we achieve that?

To do the translation and speech-to-text, we use Google Cloud APIs, Translation API and Speech-to-Text API. In this project, both translation and speech-to-text are done on our server side. Putting the API calls on the server has advantages over implementing on the client side, because this could minimize the usage of Google Cloud API while delivering the same subtitles to all clients.

For translation, we send the original subtitles from the source stream in chunks. We send data to the translation API as soon as we extract them from the source, so it could be multiple chunks sent within a subtitles(video/audio) fragment. With the translated text chunks returned from the Google Translation API, we package them together as one of our subtitles fragments into our streams.

For speech-to-text, this is not as easy as translations due to the difficulties and limitations of speech to text. It is a lot more reliable to have a full sentence passing into the speech-to-text. But because no one will be happy about subtitles showing after dialogue is finished, we couldn’t use the offline one time mode for this. Therefore, we use the streaming mode by continuously streaming the channel’s audio data over a GRPC connection to Google Cloud Speech-to-Text API. Feeding a continuous audio stream and receiving continuous subtitles text stream — perfect!

But another problem comes up with this streaming mode. It doesn’t return any result if we wait for a final stable text. Therefore, to provide reasonable low delay subtitles, we need to get the interim results from the streaming recognition. With this, instead of simply getting a recognized text, we get an array of possible text and the stability value indicates how close the recognized text is finalized. This array could change rapidly and we need to decide which text to take from the array. To pick a suitable text for our generated subtitles, we need to balance between accuracy and delay before the text is finalized. In our Hackdays prototype, to make it simple we use 0.5 as threshold on stability to pick the recognized text stream from the API service. Then we package this text as subtitles fragment together into the stream.

Together with generating and packaging the subtitles into the stream from our video backend, we update the player UI to provide a better user experience with 2 simultaneous subtitles. In the project, instead of showing 2 subtitles line by line we put the subtitles side by side. We also add a language icon next to each set of subtitles to make it clearer for the user.

What is our final achievement?

Within Hackdays, we did the prototype and some studies about what we need to go through if we would like to go live for this feature.

This is how it looks like on the client side with the generated text done on the server side.

We achieved our main goals

  1. Enable non-native speakers to watch with native language subtitles
  2. Make TV as a support tool to learn new languages

There are still some challenges we need to tackle and overcome.

For translations, there is still a limited set of languages we could pick if we use a predefined set. To overcome this limitation, either we do research to see which languages we should provide for the users or we need to allow ad-hoc language adding on the server when the user triggers a new one.

For speech-to-text, accuracy is still the biggest issue we are facing. The subtitles are not reliable and it depends a lot on the content. For example, multiple speakers or background noise or music, which occurs very often in movies and shows. One possible improvement is to turn on the speaker diarization feature to target this scenario. Another possible improvement is using the confidence level from speech-to-text responses and filter out lower confidence text. With the confidence values from 0–less confident to 1–more confident, we could possibly filter out some garbage text by applying a threshold — say 0.8 — and provide better subtitles. Unfortunately, due to time constraints during Hackdays, we didn’t investigate how these could impact performance but these are directions we could look into.

Another issue is the time delay. It’s not a problem for translations — it’s less than 100 milliseconds which is not noticeable for the subtitles use case. However, for speech-to-text, it’s rather unpredictable. As we pick the text according to stability value, the delay varies a lot. Most of the time it is pretty fast, without noticeable delay since subtitles usually last more than 2 seconds anyway. However, it’s still longer than the translation delay and unpredictable. This is a topic which needs further investigation.

It works, what’s next?

First, we need to evaluate the challenges mentioned above for translation and speech-to-text.

Second, to provide a stable and reliable feature, we still need to have a system to ensure the quality of the generated subtitles. Here are some of ideas to ensure the generation of quality subtitles:

  1. Collect user feedback/survey regularly
  2. Compare generated subtitles with existing subtitles (i.e. Generate the language that subtitles exist for evaluation)
  3. Consult professional translators for evaluation

Summary

We are happy that we developed a working prototype to show the feasibility of empowering subtitles in just 2 days. It was an exciting and fun project! The work provided us with a more solid understanding about this topic, such as what challenges we’d face, how to evaluate the quality of generated subtitles and how we could proceed if we want to go live. That definitely provides us with solid ground for future development of this feature.

I’m glad to have had a great team to work with on this project. Also, it’s a great honor to win second place in Hackdays. Thanks Lukas, Milos Pesic, Eric, Emilio, John, Sunanda, Egor Skorobogatov and also everyone supporting us. Thanks! Happy Hacking!

[1]: Ofcom — The Office of Communications, the regulatory body for UK television broadcasting https://www.3playmedia.com/2020/01/17/who-uses-closed-captions-not-just-the-deaf-or-hard-of-hearing/

[2]: https://www.who.int/news-room/fact-sheets/detail/deafness-and-hearing-loss

[3]: https://blog.google/products/chrome/live-caption-chrome/

--

--