Why Video understanding is not an easy feat (1/2)

Facial recognition, logo recognition, mood understanding, scenes segmentation, sound analysis and so on. Video understanding has come a long way these past years. However, and even though many players argue that they already use a video understanding artificial intelligence, there is still substantial progress to make to automate a full capacity to process and use what’s in a video.

The main reason is simple: a video is an evolving item, in which each image is linked to another. It is one thing to understand an image. It is another to make assertions about causal links existing between millions of images in a same content. The gap between both is tremendous. And explains why video understanding is not such an easy feat…


But first things first: as you probably hear a lot about Artificial Intelligence, Machine Learning, Algorithm and Deep Learning these times, let’s get things straight together. The process of video understanding relies on algorithms. An algorithm is a finite and unambiguous series of operations and instructions of how to solve a class of problems in order to get a result. An algorithm is not a secret spell. It is not a dangerous robot that thinks by itself either. Video understanding, even though eminently complex, is based on mathematical principles and we’ll see a bit later why this is essential to keep that in mind.

Let’s get back to the difference that lies between processing images and processing videos and do the math together. In both cases, you’ll have to deal with images, as a video is purely and simply 25 images per second. The order of magnitude varies a lot however. Tagging images is humanly possible. A trained human is able to tag 4 images a second. A 2-hour movie would require tagging 180.000 images. A single movie would take more than 13 hours to handle for a human. Indexing 10.000 movies, which is not such a huge amount, would then take 125.000 hours to process manually.

Now forget about the good old movie and think that your goal is to process YouTube videos for an advertising campaign. Given the billions of hours streamed everyday on YouTube, how long would it take to process a million of them “only” in order to create the most accurate campaign? Let’s face it: tagging videos is not fit for humans. Even though a human could do it in a reasonable amount of time, the main problem would lie somewhere else: how to give a global coherence to all these images that create a scene or a movie?


Video understanding can only exist if two conditions are met: automation and causality.

Causality can be summed up by a simple example: in a movie scene, two characters are kissing. If you take single images of the scene, you’ll think that it can be considered as a love scene. They’re kissing, duh. Now if you analyze the whole scene, images after images, you’ll notice that before the kiss, characters were chatting roughly and after the kiss, they were crying. Instead of a love scene, you might have to deal with a breakup. And if you link that scene to the previous one, you might understand that the breakup follows a sex scene with another character, letting you to believe that someone cheated on someone. Now your perception of the scene has fully evolved. Of course: you did not have the context. Neither your algorithm.

Extracting random images from a scene or a movie will never allow you to truly understand what’s in it. This explains the main challenge of understanding a video in its whole: the algorithm must link pieces of information and create meaning from it, as if it was solving a puzzle. Making this meaning possible is, for instance, working on how to properly track every scene of a content and understanding every shift from an image to another.

The second condition is automation of recognition. And this is when things start to get really challenging.


The main diffuclty of recognition lies in its diversity. Whether you want to recognize a face, a logo, an object or a mood, you will not train your algorithm the same way and will not focus on the same aspects of recognition. Making it even harder to check all boxes of video understanding.

Let’s start with facial recognition.

Contrary to what many people would think, facial recognition is not comparing calculated points of interests between different faces. Not anymore, at least. Since deep learning and big data arose, the tremendous amount of data collected, and new calculation powers allowed to train deep neural networks. From now on, the network itself is “free” to chose how to distinguish a face and compare it to another. Our engineers do not compare the faces but make sure our network is properly doing it.

Let’s take a minute to understand why this is a real change in the way to comprehend machine learning. Graph theories, algorithms, artificial intelligence use Heuristics. Heuristics is a calculation method providing a feasible answer to a complex problem in a short amount of time. Not necessarily the best solution. But one that gives you a correct result in due time. In our case: comparing faces and calculating points of interests between two different ones. The real shift is that we do not quite understand the reasoning used to achieve this result anymore. We only analyze the result. Pretty much like a human brain would do.

If you would have to answer to the question “how do you know that this is a face?”, it would be hard for you to explain it. Of course, you could argue that it’s a face because it has “two eyes, a mouth and a nose”. However, you’d be able to recognize a face even without all these points of interests. You could say it’s a face just because you feel it is one. Summing up a list of identifiable items is not how the brains works and it will never produce the same result as letting your brain assemble many variables to know for sure that it’s a face. Applying this observation in the deep learning context, engineers have allowed the algorithms to learn their own heuristics, even though it meant they would lose part of understanding. And they realized it was way more efficient than trying to “command” the algorithm.

The algorithm now has its own way to solve the problem and led to the very common feeling that “no one can understand what the machine does anymore”. As it relies even more on non-physical characteristics but only efficient mathematical operations, the algorithm’s heuristic becomes obscure for human eyes. But still: it remains the best way for the algorithm to improve at the game of faces.

Strangely enough, becoming the face master does not make it the best candidate at the game of logos or objects. A critical gap exists between all kinds of recognition. And that’s precisely what we’ll try to explain in the second part of this article.