DVB bitmap subtitles processing at Zattoo

Published in

Zattoo’s Tech Blog

8 min readMay 11, 2020

Subtitles are a very important component in delivering a rich multimedia experience, not only because they allow us to consume the content in different languages, but also because they help those with hearing impairments (~ 5% of the audience) to consume the media content. According to some figures, more than 80% of Facebook videos are viewed without sound. For the ultimate experience Zattoo is trying to deliver, subtitles are not an option; they are a necessity.

Text-based subtitles

In some cases subtitle streams are given in the form of text with presentation timing metadata. An example of these are SubRip text format (SRT):

100:00:00,498 --> 00:00:02,827- Here's what I love mostabout food and diet.200:00:02,827 --> 00:00:06,383We all eat several times a day,and we're totally in charge300:00:06,383 --> 00:00:09,427of what goes on our plateand what stays off.

A more or less similar concept is the foundation for other text-based subtitle formats, VTT (Video Text Tracking), TTML (Timed Text Markup Language). Textual content is transferred over wire and on a receiver (player) side — the text could be easily extracted from the stream. This makes them a very popular choice for delivering multimedia content over adaptive HTTP-based streaming protocols (DASH and HLS). This combination is widely supported on the vast majority of popular devices and set top boxes.

DVB bitmap subtitles

Not all channels provide subtitles in a textual form, however. Another quite popular format, especially in Europe, is the DVB bitmap subtitling format. Instead of a textual representation, a subtitle entry of this type is represented as a graphical bitmap (image) with timing and screen positioning metadata. Players receiving these images overlay them on top of the video at the time and on the screen position indicated by associated metadata.

The whole chain is text agnostic and as such, is convenient for a couple of reasons:

there is no restriction on languages that could be supported.
there is no need to worry about special characters.
text positioning on the screen and presentation are predefined so players don’t have these concerns, they should simply follow what is indicated in metadata when overlaying subtitle images. As such — text positioning is more flexible.
Channel providers have the option of creating a better, customized visualization of subtitles by controlling image quality, color schema, font and screen positioning. They could use different DVB subs image qualities for different stream qualities (SD, HD, UHD).

However there are some very limiting concerns when it comes to DVB bitmap subtitles

Very limited support on client devices when these subtitles are transmitted over standard HTTP adaptive streaming protocols (DASH, HLS).
In some countries (e.g the U.S), they do not meet standard captioning requirements which mandate that users should be allowed to customize visual appearance of subtitles (e.g increasing text size) and as such couldn’t be used.

DVB bitmap subtitles at Zattoo

Many European channel providers on our platform are still using this subtitling format and due to the constraints mentioned above, we were until recently unable to carry these subtitles to our end users. However, since at Zattoo, we aim to provide the best possible TV experience and have invested some work in bridging the gap and making the subtitles available.

For a long time, our real time system for recording streams from satellites and making them ready for delivery over HTTP adaptive streaming protocols was simply discarding DVB bitmap subtitle streams due to limited ability to carry them over the aforementioned streaming protocols.

Our goal was to keep using HTTP adaptive streaming protocols and to internally transform DVB bitmap subtitles to textual form that is supported on every client side device and could be carried over these protocols.

To achieve efficient, accurate and real time text extraction from DVB bitmaps (images), we have decided to use open source OCR (optical character recognition) engine Tesseract, which is the de facto industry standard for reliable text recognition. Hewlett Packard has initially developed the engine (started in 1985), open sourced it later and since 2006, further development has been sponsored by Google. It is an AI solution based on neural networks with pre-trained models for a lot of languages. The engine has been trained for each language and there is a ready, pre-trained model that could be used for text recognition. More details on tesseract architecture can be found here.

Implementation findings

During implementation, we came across a couple of interesting findings.

C++ Tesseract API we used is quite simple and easy to use

To initialize the engine path to the trained data files, language and engine mode should be specified.

#include <tesseract/baseapi.h>auto tessAPI = std::make_unique<tesseract::TessBaseAPI>();auto res = tessAPI->Init(trainedDataPath.c_str(), "fra", tesseract::OEM_TESSERACT_ONLY);

There are 2 main conceptual engine modes:

tesseract::OEM_TESSERACT_ONLY — legacy mode present since tesseract version 3. It’s based on machine learning computer vision algorithms. Fast and relatively cheap on resources, works well on input images with structured text.
tesseract::OEM_TESSERACT_ONLY — LSTM (Long short-term memory) recognizer, based on neural networks and deep learning algorithms. Introduced in version 4 with the aim of improving recognition rate. This is significantly computationally more expensive than legacy mode.

We have experimented with both modes and found no significant difference in recognition rate between these two modes for DVB subtitle images.

Once the object has been initialized, it is ready to be used for text extraction. Assuming that the DVB bitmap subtitle image data has been extracted from a stream (e.g by using ffpmeg API: avcodec_decode_subtitle2) and the image in RGB24 format with a given height and width is in a buffer:

std::vector<uint8_t> imageBuffer;

The following API could be used to extract the text:

char *res = tessAPI->TesseractRect(
    imageBuffer.data(),    // pointer to the beginning of the image
    3,                     // bytes per pixel
    width * 3,             // bytes per line
    0,                     // left
    0,                     // top
    width,
    height);

The result is a string containing extracted subtitle text.

The OCR engine produces results in real time with sub 200ms latency on the 90th percentile, which enables us to use the transformation process in our pipeline for delivering live content.

Improving recognition accuracy

At Zattoo, we were in a position to observe the text recognition process for a high number of channels continuously, constantly transforming subtitle images to textual format, analyzing them and applying some techniques to get the best possible accuracy (> 95%). Some of these techniques are described below.

Truncating background

When a subtitle text represents a relatively small portion of the image, we have observed frequent failures in text extraction. An example of an image which would frequently result in text recognition failure:

To solve this problem, we experimented with trimming unnecessary background noise before submitting the image buffer to the OCR engine:

As a result we have achieved reliable recognition in 100% of these cases.

Inverting colors

Although the recognition rate was quite good, we wanted to try to further improve it and an idea which proved to be successful was born while looking into how the usual subtitle image appears (black background with white or colored text). Recognition rate in LSMT mode was ~ 20% worse compared to legacy tesseract mode in these cases.

We knew that OCR engine neural network has been mostly trained on text where the background is white and text is black or colored — we therefore decided to invert colors of the DVB subtitle image before asking OCR to extract the text:

This simple and cheap technique brought about an even higher recognition success rate, especially in LSTM mode — which ended up having the same recognition rate as the legacy mode, after applying the transformation.

Reducing number of color components from 3 to 2

We started observing relatively low success rate for a case in which we have 3 dominant color components in the image, e.g:

As you can see, in this example, we have 3 different colors for background, text border and text itself. During experimentation we found a relatively simple technique to increase the accuracy significantly. In all these cases, we observed that the text border color is black, so we came to an idea to change the background color to black instead and effectively reduce the number of dominant color components to 2 , e.g: