On Language Detection: Classification x User Consumption

I wrote this post when I worked at Dailymotion, it first appeared on its engineering blog. You can also check Colomb Thomas’s article on PHP7 deployment at Dailymotion here on Medium.

What language are we talking about?

At Dailymotion, we aim at serving the right content to the right person at the right time. Except for funny cat videos whose cuteness is independent of any language. We want to deliver intelligible content that the user will actually enjoy. But the truth is that nowadays, people may speak more than one language fluently.

© Jakub Marian (overlay), Tindo — fotolia.com (blank map) — details

We can therefore serve multilingual content to part of our audience. On our side, we end up with two linked issues concerning language detection:

  • a content-centric issue : understanding in what language a video is and if language is crucial (for example, language is not crucial when you watch a football match but can be if you watch a political debate),
  • a user-centric issue : knowing which language(s) a user can understand.

In this article, we will cover the content-centric issue and how we determine in what language is a video. Note that our primary goal here is to minimize false positives in language detection.

Language detection

For each of our videos, there are two main areas where we can detect languages. The first part are the textual fields a.k.a metadata, provided by the owner — the video’s title and description — and the second part is the video itself with its frames and audio content. For this video of beautiful corgi puppies, we can perform language detection analysis in the textual metadata fields “Miniature Corgi Puppies Are Unspeakingly Adorable” or in the video content itself as we will see below.

Metadata language analysis with our custom detection tool: the BubbleTea API

To detect the textual metadata language, we rely on multiple well known machine learning classification processes such as Langdetect and CLD2. These classifiers have been trained over carefully scraped web pages. We then created BubbleTea our custom language detection API. Using a voting system, we manage to avoid most false positives and only classify text into a specific language when we are sure of it.

While this gives us the metadata language and whether it would be fit for presentation, this does not guarantee that the video content itself is in this detected language. In the case below, the metadata are in French but the content itself is in English.

Video content language detection

For the internal video content, if we want to understand the spoken language, there are multiple solutions such as analyzing the audio stream with spoken language detection services or speech2text tools such as Watson, Microsoft Translator or Uberi’s speech recognition library and transcript the audio content or bring OCR into the game to get subtitles or identify any written words in a video frame. These tools are very interesting but they are also costly in terms of computation power or time and since we would need to run them on all our videos, it would add a significant load to our workflow.

Moreover, if we consider video mixes such as these awesome cute animals compilations, there is no « true » language for this content. These videos would be language independent (just like the cuteness of puppies). So we decided to try out an alternative approach where we combine metadata identification and content usage.

Metadata x Usage knowledge approach

To get more language information on the popular content we have, we query the BubbleTea API to identify the language in the metadata and then, observe how the audience actually watches the content. Because in the end, the best person to tell us what you can understand is you :)

After some initial analyses to identify the origin of our visitors and which features were the most meaningful (their country and their language), we observed how « engaged » our audience was when watching a video. Using a variant of the Herfindhal index called the Smoothed Herfindahl Index (SHerf), we identified where the origin of the audience was most concentrated. This feedback allowed us to improve our understanding of where the video is likely to be viewed and appreciated. SHerf applies custom weights to the origins concentration values. We could then begin to answer questions like:

  • Is the language provided by the video uploader correct?
  • Is this video content language-dependent?

After defining a concentration threshold, we distinguished two main types of content: high concentration content — suitable for one main origin and low concentration content — not very language specific content, and thus suitable for many origins. Below, you will find the initial Herfindahl index definition where si is the market share of firm i in the market, and n is the number of firms.

The downside for both our methods is that we either need enough text for the metadata classification or enough audience usage to perform SHerf. This means that for videos with not much metadata or very low view counts, we need to come up with alternative solutions.

What we get

To illustrate our approach, we looked at the results of Dailymotion’s most viewed videos over a month. So what does the usage x metadata approach give and what did we find?

Is the language provided by the video uploader correct?

For our corpus of videos, we found out that 1 out of 4 times, the language provided by the video owner did not match the language calculated by BubbleTea. This proves the necessity of not relying only on provided metadata. This 1 in 4 error rate is an upper bound, since the videos we analyzed are deemed of good overall quality. For much of the rest of the catalog, the error rate concerning language in metadata supplied by the owner is higher (mainly “English by default”).

When we compare declared and BubbleTea-detected languages to what the users actually say they understand, we find that declared language matched viewers consumed languages for less than 40% of the cases. BubbleTea-detected languages performed slightly better, this indicates that metadata language detection beats declared language in this case. By itself, however, BubbleTea is still not the optimal solution for detecting all suitable languages for a video.

In more than 60% of our cases, the metadata x knowledge showed alternative suitable languages other than the declared one.

Is this video language-dependent?

Thanks to the metadata x usage knowledge, we are now able to detect how tightly a video is correlated to a given language in terms of audience understanding. Videos with low SHerf values are very loosely connected to specific languages. Low SHerf videos are compilations of fail videos, sports videos, popular music clips, nursery rhymes and blockbuster trailers. Videos with high SHerf values are more tightly correlated to a specific audience, such as TV news segments, documentaries in non English languages, videos of regional events such as debates, local sports or sitcoms.

Additionally, we discovered that we could stretch the target audience of these videos from one specific language to almost three (ok, 2.9 to be precise). This means that most videos are not targeted towards one specific audience language. A video’s audience coverage is not limited by the language it is spoken in. We can now widen this video’s exposure to almost triple the potential viewers. We have a way of automatically identifying such content and even qualify it, thanks to the metadata x usage approach. Here, as an example, we know that football goals and highlights are language independent.


With the metadata x usage approach, we also run into unexpected findings. We discovered that our approach bubbled up videos in language A that had subtitles in language B, which means they are suitable for both languages A and B. This clearly improves a video’s potential audience and we did not even need to bring OCR into the game. Also, we detected trends of videos that are rising and not local to a given place, such as English nursery rhymes being popular in the Middle East. This is interesting in both understanding suitable languages for a video and detecting new trends.

Take this away

With this methodology, we combined intrinsic video attributes and our acquired knowledge of viewing behaviour, to obtain additional details about our content. We get the best of both worlds since we obtain the language of a video content AND we also increase the potential audience of a video (low SHerf ones) or narrow their audience if they are very language specific (high SHerf ones). This process brought us language insights using a from-audience-to-content workflow. This can also go the other way around.

In this article, we wanted to show how having data is great, but identifying the job that has to be done is at least as important as the data itself. You will not know what to do with your data, unless you carefully understand and describe your problem. When we identified the different language issues at stakes, we were able to solve each one with the appropriate answer. With a naive approach we would have simply decided to classify metadata language and deemed this result to being the truth. By also integrating audience behaviour in the process, we managed to temper this “truth” into something more realistic and interesting to work with. Long story short: having data is great, but thinking out of the box to combine knowledge and available tools is even better!

Senior Data Scientist @ Renault Digital, PhD, Forever Learner, former-Presidential Innovation Fellow.