How ProSiebenSat.1 automatically creates thumbnails that capture your attention!
Welcome to a post from the P7S1’s AI Products department. In this article, Anca and Julia will guide you through our automatic thumbnail extraction service, a solution that was developed during our “unscheduled time.” This cool set-up allows us to use 10% of our working time for creativity, innovation, and collaboration on side projects.
Goals of the process
Our automatic thumbnail extraction service is now part of our productive video mining platform, named ICON (Intelligent Content), and is used to support our colleagues from Program Data Services (PDS) to semi-automatically select high-quality thumbnails for video content. With this service we want to achieve the following goals:
• By utilizing Machine Learning and Image Processing techniques, we provide high-quality images for episodic thumbnails for video content, where they are hardly available.
• We assist the editors of PDS in providing high-quality thumbnails for their clients and associated distribution platforms, for example, publishing providers, EPG platforms, the Teletext, OTT/IPTV platforms, or the print media and press.
• We improve the presentation of our content catalog, where currently generic thumbnails are used (i.e., the same image for the whole format or season). By that, we increase user experience with unique thumbnails.
The immediate business benefits for our colleagues from PDS are a reduction of manual effort for providing thumbnails and thus the possibility to enable faster and higher coverage of TV formats that can be equipped with individual preview pictures per video clip.
Requirements for the perfect thumbnail
Before we go into more technical details regarding the approach, we would like to mention the requirements the extracted thumbnails should meet. First, they need to have a certain image quality and second, they need to show relevant content from the video. Regarding image quality, the thumbnails need to be sharp, have a certain symmetry, and have good illumination.
Regarding relevant content, they should either portray the host of the respective show with the most relevant guests in each specific episode or display the most relevant visual content from each episode. For instance, in a cooking show, a thumbnail with nicely arranged food on a plate is highly relevant to the content.
Therefore, our technical approach (see Figure 1 below) is twofold: 1) We extract all keyframes with sufficient image quality; 2) We rank the keyframes based on content relevancy.
We use sharpness, brightness, and symmetry values to ensure image quality. Open-source libraries for computer vision, such as OpenCV, allow us to easily derive those parameters. To avoid images that are too blurry, we consider both the sharpness of the entire image and the sharpness of the people in focus. Each TV format has a certain range for these values to make sure the quality metrics are adapted to the format specifics. If, for example, a show is filmed in darker settings, the accepted illumination range needs to be different from that of a brightly lit TV show to produce high-quality thumbnails.
To account for the visual relevance of the content, we focus on two types of thumbnails. First, we aim for those images containing the most relevant people in the episode, and second, the thumbnails displaying the most relevant visual concepts of the episode.
In order to find images with the most relevant people, we first use Amazon Rekognition to detect faces. We then cluster similar faces and focus on the largest clusters. The assumption behind this is that a TV host or an important guest is most likely to appear most frequently in an episode and is thereby represented in the most prominent face clusters.
Thus, we can limit the number of clusters we want to derive thumbnails from. Additionally, we use open-source libraries for face recognition to derive face features, which help us to determine whether faces have their eyes open, and mouths closed to avoid having any people in the middle of speaking.
Each TV format has predefined configurations that can be adjusted to address the specifics of the formats and the needs of the business. For instance, the thumbnails for the show “Rosins Restaurants” (Rosin’s restaurants) should not just contain images of an empty restaurant with empty tables but rather a restaurant scene filled with people at the tables. To ensure this, we apply object detection on the video using Amazon Rekognition and then introduce constraints on the returned visual tags defining how they should co-occur for some formats.
For the relevance of visual tags, we also apply a TF-IDF (Term Frequency — Inverse Document Frequency) metric to select the most frequent tags and then account for the hierarchical relationship of the tags (having few tags of different specific animals might not give high weight to each individual tag but rather to their higher-order parent tag ‘pets’).
In the development phase of the service, we heavily relied on feedback from our stakeholders and conducted extensive user acceptance tests with them, so that we get an understanding of the kind of image they deem as being ‘usable’ or having ‘bad quality’ and/or ‘bad content’, referring to image quality and relevance respectively. Below in Figure 2, you can see some examples.
The last steps
We went productive with our solution at the beginning of 2022. To ensure the quality over time and scalability of this service, we monitor the results and work on adding new TV formats to the processing chain. With regards to quality, for each processed episode we take into account the number of potential thumbnails filtered out at each processing step (e.g., not reaching the specified sharpness or illumination thresholds) so that we can easily see the cause of potential thumbnail reduction. The editors that use this service pick the final thumbnail based on a ranked selection of images we propose to them. Therefore, we also check how many thumbnails have been selected by our colleagues as usable to learn what makes better thumbnails and to figure out when something is not working anymore for a TV format.
We hope you liked this glimpse into our approach for an automated thumbnail extraction and hope to see you again in our next AI blog post.