On topic annotation: how to extract relevant labels from videos?

Published in

Dailymotion

8 min readFeb 4, 2019

Dailymotion is a global video platform hosting hundreds of millions videos in more than 20 languages. Our purpose is to share the most compelling music, entertainment, news and sports content around, thanks to partnerships with the world’s leading publishers and content creators, such as France Télévisions, Le Parisien, CBS, BeIN Sports, CNN, GQ, Universal Music Group, VICE, and more.

In order to be able to characterize the content of our video catalog, the Data Team is working on automatic algorithms to extract topics from the video.

“But why do we care at Dailymotion about being able to accurately categorize content at scale?”

Watching interface: push videos with trending and popular topics, recommend videos related to a topic.
Search engine: retrieve videos from a given topic.
SEO and acquisition: increase the external visibility of our video catalog and get more new visitors.

“What technical challenges do we face for a relevant topic annotation algorithm?”

Relevance & quality of the topics: we want to have relevant and specific/precise topics. E.g : “2018 FIFA World Cup” vs “Football”.
High precision/coverage tradeoff: we want to tag (or cover) the maximum videos with at least one topic and with a minimum error rate. E.g: we can have a 100% coverage with (almost) random topics vs 50% coverage with very accurate topics.
Fast and up-to-date annotation: we need a fast annotation pipeline that proposes updated topics. E.g: “Juventus” for a video related to Cristiano Ronaldo vs “Real Madrid”.
Multi-lingual annotation: we need to tag videos for all the languages. E.g: French, English, Korean videos…

Demystifying our topic annotation pipeline

In this section, we will present the different steps of the actual topic annotation pipeline running at dailymotion.

Let’s get a bit technical!

There exist two types of data describing a video:

Raw data (aka metadata): the description, the frames and the audio.
Additional data: the channel, the country/language, date, …

In order to have a robust model, it is important to work as much as possible on (raw) metadata.

Before going into details, here is an overview of the actual pipeline:

Schema of the actual pipeline running at dailymotion. Each step is presented below.

NB: In this post, we will focus on the pipeline running on the video descriptions only.

1. Text extractor and language detector

This step takes all the metadata related to the video as input and outputs the description and the associated language.

This first step is very important since it impacts all subsequent steps of the pipeline. It is a complex task we won’t detail and you can find all the explanations here:

On Language Detection: Classification x User Consumption

At Dailymotion, we aim at serving the right content to the right person at the right time. Except for funny cat videos…

medium.com

2. Topic maker on text

This step takes the video description and the corresponding detected language as inputs and outputs candidate topics related to the description.

Our approach is different from traditional Natural Language Processing (aka NLP) tasks with semantic analysis of the text. Instead, we used a framework that maps an unordered set of words (the video description) to universal Wikidata entities (candidate topics).

NB: this task is called “Named Entity Linking” (aka NEL).

Wikidata is an open source knowledge graph with billions of updated and inter-connected entities:

Example of a Wikidata entity defined by its id: “Q15869”. More details here.

Now, let’s explain how we do the mapping:

Preprocessing phase

As usual, we first need to clean the descriptions and tokenize the words.

We remove low-frequency words, one-character words, … Also, we deal with overlapping words since words may overlap or be a substring of one another, we need to detect their boundaries.

Example: “jaguar cars”, we discard the token “jaguar” and keep “jaguar cars” instead.

Disambiguation phase

For a given word, we would like to pick the appropriate sense in case of polysemy.

For example: “jaguar” can refer to the car or the animal… Actually, there exists a lot of Wikidata ids referring to this word: “Q35694”, Q26742231”, “Q2382933”, “Q650601”, …

The disambiguation process is a combination of two features:

The commonness (or prior-probability) of the sense given the word. It is computed as the probability that the word points to the sense by parsing the Wikipedia dataset. Back to the example, the animal is more likely to be the appropriate sense…
The relatedness score between the word and the sense. The idea is to pick the sense that best fits the context of the text: we use a voting scheme where all other words vote for the sense. The vote of a word to a sense is based on the overlap between their in-linking pages in Wikipedia. Back to the example, if our description is “Tata Motors acquisition of Jaguar”, the word jaguar is related to the car…

Then, we pick the sense with the highest disambiguation score, which is the sum of the two weighted features.

Pruning phase

Once we have mapped each word to a unique sense (or Wikidata entity), we want to select only the meaningful entities to generate candidate topics.

Back to our example: “Tata Motors acquisition of Jaguar”, we might discard “acquisition” which is not meaningful enough…

The pruning phase is a combination of two features:

The link probability of a word: the probability that an occurrence of the word in Wikipedia is a word pointing to some Wikipedia page. Back to our example: “Tata Motors acquisition of Jaguar”, the word “Tata Motors” is very likely to be a link in Wikipedia as we can see in figure below:

*“Tata Motors” is very likely to be a link in Wikipedia.*

The (averaged) coherence of the word within the description. For each word (and its unique associated sense thanks to the disambiguation phase), we compute a coherence score with each other words and average all the score. The coherence score between two words (or senses) is also based on the overlap between their in-linking pages in Wikipedia. Back to our example: “Tata Motors acquisition of Jaguar”, the words “Tata Motors” and “Jaguar” are coherent between each other since both of them are car companies.

3. Topic Filter on text

This step takes a set of candidate topics as input and outputs/selects the accurate ones.

Finding accurate topics related to a video is not an easy task. Very often, our candidate topics are not accurate enough for several reasons:

The precision threshold allowed is very high (for example if we set a 99% precision for our topics).
The description is ambiguous or not representative of the video (this is the case if accurate topics come from other metadata like the frames or the audio).

Thus, we need to build an automatic (machine learning) model that first ranks the candidate topics and then selects the accurate ones.

Ranking of the candidate topics

Feature engineering

For each candidate topics, we compute features that will help the model to decide whether the (candidate) topic is central or off.

NB: a “central” candidate topic is a topic associated to a video, otherwise if it is “off”, we discard it.

In addition to all the previous features (disambiguation score, averaged coherence score, …), we compute features related to the (Wikipedia) popularity of the topic, the match (occurrence and position in the description/title), the affinity between the candidate topics and the coherence with the channel (for example a candidate topic about Basketball is more likely to be central for the channel BeIN Sports than CNN…).

We are also working on integrating topic and channel embeddings:

Bag-of-words representation for video channels’ semantic structuring

Dailymotion is a video platform that hosts millions of videos owned by tens of thousands of channels. Videos are made…

medium.com

Training of an optimal machine learning model on the features

In order to build our training set, we first had to manually validate which candidate topics are central or not on a small set of videos (I know that’s a boring task but Amazon Mechanical Turk is doing it for you ;))

NB: in fact, validating topics is not an exact science. In some ambiguous cases, people might argue a topic is central to the video whereas others might argue the contrary! Fortunately, we can overcome this issue by averaging the votes…

Then, we train (or fit) a model on a subset of the training set and find the optimal hyper-parameters on the rest of the training set (aka validation set): the optimal model has learnt (complex) rules on the features that will automatically decide for a new video which candidate topics are likely to be central or off. Indeed, the estimator will output a centrality score for all candidate topics for a given video, which allows us to do the ranking.

Selection of the candidate topics

The selection allows us to push accurate topics on the dailymotion interface.

For the last step, we compute a centrality threshold on the validation set above which candidate topics are considered to be central to the video. We determine it by setting a minimum precision for our topics.

Once we obtain the centrality threshold, we can apply it and deduce the associated coverage on our video catalog and we are done! :)

NB: the coverage on our video catalog is relatively close to the coverage on the validation set.

What’s next?

Our solution meets all the criteria for a relevant topic annotation pipeline (relevance of the topics, good precision/coverage tradeoff, up-to-date & multi-lingual annotation).

In addition to other exciting projects (e.g: recommendation systems), the Data Team keeps on working on improving the characterization of our video catalog by:

Working on the frames to cover the videos with no descriptions (or incomplete ones) or to get more topics per videos.
Working on the topics to build an automatic categorization of our topics (example: “2018 FIFA World Cup” < “Association Football” < “Sports”).
Working on both descriptions and topics/categories to get a “contextual” categorization of our videos.