Why Video understanding is not an easy feat (2/2)
Back in 2015, researchers at the University of Cambridge and Stanford University showed that given enough data, their algorithm was able to predict your personality traits better than another human being, depending on your Facebook likes. They would only need 10 likes to know you better than your work colleagues and only 300 to have better results than your husband or wife. Kind of creepy? Well, not so much if you stop seeing the problem with human eyes.
Speaking about algorithms and video understanding especially, the common mistake we tend to make is to get human considerations and algorithm’s way to solve them mixed up.
We falsely assert that problems we consider complex to solve will be complex from computer science’s point of view. On the contrary, a simple issue for a human mind could be highly difficult. Being able to predict people’s wishes and reactions depending on their previous behavior seemed an impossible issue. The “Facebook Likes” example showed how wrong we were.
When it comes to video understanding, some false believes can also alter our perception of the problem(s) that really occur trying to automate all kinds of recognition.
A FACE IS NOT A LOGO
Now that we’re clear on how the algorithm works (even though we do not know for fact how it does it), it is important to differentiate two steps when it comes to recognition: detection and identification. In other words, knowing that there’s something to recognize and knowing what/who it is.

The good thing with faces is that they pretty much all look alike and have important similarities. Comparison between many will mostly focus on the same physical references, apart from babies and very unusual faces. On the contrary, a logo can take infinite forms, from a single arrow to a complex picture. Then, it becomes very tricky to detect the presence of a logo as it takes various forms. What about KFC logo, that is also a face? Now, you start to understand.
OBJECTS: TO INFINITY AND BEYOND
Face, logo recognition or settings classification remain all relatively simple as they follow finite rules. Faces have almost all identical reference points, as do the settings (differentiating a mountain scene from an sea scene). And even though logos are trickier to detect, there is a limited number of them in the world. Worst case scenario, you would “only” need to process them all and add them to your database . Even in that case though, you’d have to make sure you can recognize a logo in every situation, from a billboard to a creased t-shirt.
Objects however have infinite rules, mostly because these rules depend on each object type and category. To put it in a nutshell, you would have to recognize every type of an object in every category to make sure you can recognize it in every situation. Take a chair for instance. You can train your algorithm to recognize a specific type of chair, let’s say the Bodil Accent Chair from Made. But you will have to repeat the same exercise for every kind of chair that differ too much from that model. Only three Legs? Not a chair. Leather and not plastic? Not a chair. Round back? Still not a chair.
Ikea coffee table might well be mistaken with Amazon night stand, the algorithm would not even raise an eyebrow (metaphorically speaking). Given that there are an infinite number of objects with an infinite number of forms, object detection remains the holy grail in recognition.
FROM IMAGE INDEXATION TO VIDEO UNDERSTANDING
Recognition automation allows tremendous new opportunities for both media and advertising industries. It enables the ability to scan almost every part of any content automatically and understand what’s going on: who’s on screen, what they are doing, the mood, the brands and objects appearing and so on. Understanding what the audience is watching in real time is bringing contextualization to content and inventing new forms of interaction.
Pushing Ronaldo’s new Nike shoes every time someone is watching a video of him on YouTube. Or promoting an actor’s new movie because you are watching his latest interview on a television channel. Recognition in this case could give a new life to television and invent new advertising territories. It also enables a new mindset for digital marketing.

However, one should not think that fully understanding a piece of content is already a done deal. That’s what makes the big difference between factitious recognition and real Artificial Intelligence. In the first case, you do not train the algorithm: you just give a list of commands that the algorithm will use in a specific context. In the second case, you teach the algorithm the gift of perception, that is supposed to be purely human. That means that the algorithm gets to create its own rules, from natural images, to distinguish a face, a logo, a scene from a different one. You do not only tell the algorithm “this is Brad Pitt, and this is Tom Cruise”, you train the algorithm to calculate by itself why the nature of each is different. This is a slow and tricky process.
AT THE DAWN OF VIDEO UNDERSTANDING
The reality is that AI on this topic (like many others) is at its early stage and that even though facial recognition is advanced, no one can say that video understanding is a mastered topic. The algorithm must be able to both automate recognition and to create meaning between the different items of a scene and the different scenes of a content. Light years away from the “4 tags per second” human expert.
What still separates artificial intelligence from complete video understanding is the incapacity to translate everything into mathematical principles. Let’s take the chair example again. We explained why the infinite number of chair types was making it very hard to classify all chairs. Still, you could do it. Or you could try to describe as accurately as possible the components and attributes of a chair. But the day that kind of chair appears, now everything the algorithm has learned is useless. Indeed, everything you thought would be objective and tangible information to summarize a chair is not part of the current example. In this case, there is no objective reason for the algorithm to assess it is actually one. Same thing if an actor imitates another one so well than it becomes confusing. Same thing if a logo is a face and needs to be recognized on an object.

All these specific but infinite examples are hard to process because they do not rely on strict mathematical principles, only on common sense and experience that are still very human. A day will come when video understanding will include common sense but as of today, let’s say that we human still have this competitive advantage against AI. However, automation is and will remain the only way to scale and fully process video understanding. And that is precisely what we work on at Reminiz.
Recent video technologies have broadened very exciting fields for video understanding automation: split screens, screen within a screen, special effects, filters are as many challenges that explain why video understanding is not such an easy feat, but also why it has and will keep creating amazing opportunities for the media industry within the next years.

