The journey of utilizing AI for content archiving

Published in

ProSiebenSat.1 Tech Blog

7 min readOct 5, 2020

Hi and welcome back to our series about AI in the media. In our last article Karthik, one of our team members, gave you a comprehensive overview of our team’s work. In case you missed it: We are AI Platform Services, a small team of six AI enthusiasts consisting of product owners, AI engineers and architects. This time, we would like to take you on a trip into our world, showing you one of our applied cases called “Automated Tagging”.

The work of the DokuCenter

A lot of the content that ProSiebenSat.1 is broadcasting on our different TV channels is newly produced content. This means that a team consisting of editors, camera operators, sound engineers and cutters creates it from scratch. But some of the content which was shot maybe for an episode of “Galileo” might be also suitable for an episode of “Abenteuer Leben”. In other words: It makes sense to reuse some of the content. However, the content needs to be archived so that editors are able to take some of the scenes or even only short sequences from “Galileo” and use them for “Abenteuer Leben”. Archiving does not only mean to store content safely with title and transmission date. It is a more complex process and at the end a whole bunch of further information is stored. Thus, we call it content documentation instead of archiving.

A unit called DokuCenter, made up of trained media documentalists, is responsible for documenting and archiving our produced content. Media documentalists examine the material and document for each scene what is visible, what happens and what is relevant for reusage. Think about a report about a farm in Denmark. In our DokuCenter the media documentalist would note the following things for the first scene — which only lasts a couple of seconds:

‘Historical farm (Denmark) in exterior view. House with thatched roof. Farmer Hans Müller walks into the picture. Close-up on a modern combine harvester (neutral). Farmer gets into combine harvester….’

This short example gives you a first impression of the complexity of documenting content. The complexity is also reflected in the time needed to document one hour of content. The manual documentation of a one-hour-tv-report can easily take a couple of hours, depending on the format that needs to be archived. In times when content is created on a daily basis, this situation leads to a pile-up of content, which needs to be archived, and to some content, which cannot be documented at all, due to the lack of media documentalists. Moreover material, that is not enriched with metadata, cannot be archived and therefore cannot be reused!

You might ask yourself: What part does AI have in all of this? Well, let us explain…

The project

In order to investigate the potential of AI-models for audio-, image- and video-data considering different business units within ProSiebenSat.1, we started a research project in 2017. The project was founded by “Bayrisches Staatsministerium für Wirtschaft und Medien, Energie und Technologie”.

The aforementioned DokuCenter was one of the teams we focused on in this project. Due to its relation to this kind of data, the possibility of reducing manual effort and to the chance of expanding media documentation, the DokuCenter was a good candidate for our research. In order to find the right solution for supporting the media documentalists in their daily work, we asked our colleagues and ourselves three main questions:

1. Which program/formats might be best suited for (semi-)automatic documentation?

2. How can automatically generated tags be as close as possible to the ones that documentalists create manually and therefore have a quality, that meets the standards of media documentalists?

3. How can the data be integrated into the documentation process — without much manual effort?

Research phase

Our first step was to talk to our colleagues from the DokuCenter. We wanted to get a better understanding of the requirements we needed to consider and to start sketching a possible solution. Additionally, we had to take a central decision at the beginning of the project: What model did we want to use to document content automatically? This question was less related to the “capabilities” of models, i.e. the recognition of objects, actions or persons in moving images, but rather to the origin of the models. In general, three different main approaches can be distinguished.

a) In-house development or training of models

b) Freely available algorithms and models (“OpenSource”)

c) Commercially offered services and models

In the end we decided to combine all three approaches. Since pre-trained “OpenSource” models also allow further training and thus adaptation to one’s own data, we emphasized that model over the others. This decision was also based on the assumption that we would get sufficient training and test data from the DokuCenter. In retrospect, we know that our way of thinking was only partly correct. The improvement of “OpenSource” models (transfer learning) was discarded due to the immense manual effort involved to transform the available data into a suitable format for training.

The pre-trained “OpenSource” models held another challenge for us in place. Although these models achieve good results with freely available validation and test data sets, they could not fully meet our expectations in terms of precision and recall, especially when applied to “real life” video content.

Nevertheless, at the end of this project phase, a positive conclusion could be drawn. In general, the idea of automatically generated tags was promising. The automatic tagging of content can basically support the existing documentation and archiving process for suitable formats like magazines and documentaries. Even though not all prerequisites for the full usage were given yet, a new and viable base to work on was created.

Getting productive

The second project phase began in spring 2019 and was intended to create a practical solution for the “DokuCenter”. In order to reach that goal, we defined two main areas to focus on:

1. Increase the quantity and quality of machine-generated content metadata

2. Full integration of the entire automated documentation process (as far as possible for the selected formats)

In order to realize both points, two central aspects of the overall system had to be completely transformed. First, the infrastructure on which the system is running and second, the machine learning models used within the system. Since the option of focusing on the “OpenSource” models was discarded, we decided to change the chosen strategy and to replace the existing combination of all three models completely with models and services from commercial vendors. Advantages of commercial models over “open source” models include:

1. Continuous development and improvement of the models

2. Easy utilization through APIs and/or use of cloud services

3. Increased focus on testing, general implementations and postprocessing

4. Easy exchangeability if your overall system is prepared for that

However, it is important to note that the choice of a “model strategy” always depends on the respective use case. For this specific application using commercial models brought about the desired improvements. The quantity of data provided by the models in the form of objects, actions and concepts increased significantly. The same improvements could also be observed in the quality of data. This was reflected in higher values for the key measure precision.

However, improving the quantity and quality of the machine-generated data was only the first step towards achieving the goal of creating a useful system. As mentioned at the beginning, the infrastructure of the overall system was also to be adapted. To do so, we took the following steps:

1. Moving the system from owned and operated servers into a managed Kubernetes cluster: This helped us to focus more on developing a solution rather than dealing mainly with operations and maintenance.

2. Implementing a fully automatized process of generating content meta data tags. This included:

a. The automatic download of content, that is ready to be documented

b. The preprocessing of the content in order to prepare it for the machine learning models

c. The postprocessing and storing of the generated meta data tags

3. Integrating our system into the one that is already used for archiving by the media documentalists

Especially the postprocessing turned out to be another important step towards improving the quality of the data we provide. Quality and relevance are two highly important aspects in this use case, since the archiving system otherwise mainly holds human-curated and therefore high-quality data. We asked our colleagues from the DokuCenter to judge the tag suggestions for a number of videos in order to determine the current precision of our system and to identify room for possible improvements. Based on these findings we established a postprocessing consisting of the following steps:

1. Applying thresholds on tags

a. Based on the data obtained through the judgment of the DokuCenter, we were able to determine a minimum standard for quality.

b. Furthermore, we identified the consecutive visibility of a tag as an important criterion for its correctness, as well as for its relevance. That is because media documentalists only document content details, that are visible for a certain amount of time.

2. Filtering of the tags

a. Additionally, we identified tags that are in general of no interest for our use case, e.g., tags that merely point out that a human or a face is visible. We also filtered all tags related to video games, since the content we are processing does not contain video games.

By means of these adaptions we were able to raise the quality and relevance to a level sufficient for productive usage. We are, however, still working on further improvements. Plus, we keep an eye on developments in the service landscape in general.

Conclusion

To sum up everything, we can state that the goal of using video mining has been achieved, namely to support the manual content archiving process for selected broadcast formats. Nevertheless, it should be pointed out that we did not achieve that by a simple “plug & play” of “Open Source” models or a solution from a commercial vendor.

In our next article we will discuss the struggles of training, testing and applying machine learning models to “real life” content– stay tuned.