Hello, World: Building an AI that understands the world through video
Machines today can identify objects in images, but they are unable to fully decipher the most important aspect: what’s actually happening in front of the camera. At TwentyBN, we have created the world’s first AI technology that shows an awareness of its environment and of the actions occurring within it. Our system observes the world through live video and automatically interprets the unfolding visual scene. Check it out yourself:
Deep Learning as a succession of bold attempts
Evolution is rarely linear. As with other technologies before it, deep learning has followed a series of step functions defined by sudden, often unexpected, outbreaks of capability. Each step function fundamentally pushed the envelope beyond what computers were previously able to achieve. One of the first such breakthroughs came in 2012 when Alex Krizhevsky, Ilya Sutskever and Geoffrey Hinton showed that deep neural networks, trained using backpropagation, could beat state-of-the-art systems in image recognition. Since then, similar breakthroughs have occurred in previously intractable problems. These range from machine translation and voice synthesis to beating the World champion in the game of Go. Importantly, each milestones seemed out of reach at the time, which made their achievement even more surprising.
At TwentyBN, we believe that the next breakthrough will concern video understanding. As with previous discoveries in deep learning, there has been a widespread consensus that human-level video understanding is far out into the future. The best available technology is merely capable of identifying or segmenting objects and people in frames, but not more. Human-level visual intelligence seems like a far cry from what’s technically possible.
Providing AI systems with an awareness of the physical world
Today, we present the first proof point that our conviction is paying off. We show that neural networks trained with backpropagation are perfectly capable of learning enough physics about the world to understand exactly what is happening in visual scenes. The video footage presented here shows a deep neural network that was trained on hundreds of thousands of labeled video clips that show highly complex human actions. The training videos were designed to capture many of the complex and subtle ways in which objects and humans can move and physically influence each other in the world. The resulting system must learn to understand the verbs (motions) not just the nouns (objects) of a video’s content. It also has to infer a wealth of cues about object behaviors and relations to predict the correct labels.
Many of the videos were recorded by crowd workers using a web-platform we built specifically for the purpose of recording training material to teach neural networks “intuitive physics”. The videos range from simple hand motions and actions (like swiping left or jumping forward) to complex object manipulations and interactions (such as “Trying to pour water into a cup, but missing so it spills next to it”). Children can learn about the world in part by running countless “experiments” during daily play. While we cannot ask our networks to do the same, we tried to achieve an analogous result by instead asking our networks to watch and then draw correct inferences about the videos in our database.
The tasks were designed to be so hard that to successfully solve them the network would have to develop a deep understanding of many real-world concepts. For example, the network is asked to tell “Throwing [something] in the air and catching it” from “Throwing [something] in the air and letting it fall”, and “Putting something behind something” from “Pretending to put something behind something, but not actually leaving it there”. In the demonstration above and below, we intentionally used placeholders of [something] in order to put emphasis on the motions.
When training the first networks on this data, we were prepared to be “tutoring” them by using architectural tricks and coaching signals to allow them to learn at all on these extraordinarily difficult tasks. However, we found that as our database grew, the system started to figure out the internal representations they needed to output correct solutions by themselves. While there are a lot of engineering, infrastructure and tricks required to gather and then have neural networks cope with the large amount of data, we found that beyond these, once again, backpropagation and well-designed data go a much longer way than widely assumed.
Teaching machines visual common sense
Let’s put our video demonstration into context. The vast majority of commercial use cases in video understanding are stretching the envelope of what current technology is capable of. The reason is that machines today are missing a common-sense understanding of the real world. We humans rely constantly on our innate ability to understand and make inferences about our environment as we navigate the physical world. Our common-sense reasoning is built up through lifelong experience, beginning in early childhood and extending well into adulthood.
Videos are ideally suited for teaching a lot of knowledge about the world because they implicitly encode the physical constraints that define it. Unfortunately, most existing systems approach the video understanding problem with image understanding components. As a result, virtually all solutions use frame-by-frame analysis that can identify and segment objects in isolated frames, instead of understanding motion patterns and object behaviours based on holistically labeled video segments.
Our bold attempt to reimagine video understanding is starting to bear fruit. It has allowed us to successfully train neural networks end-to-end on a wide range of action understanding tasks (see the example predictions at the beginning of this post). No AI system had appeared anywhere near solving these tasks just a few months ago.
The animation below illustrates this further on the basis of one of the more difficult examples from the dataset. A system trained to perform frame-by-frame analysis can at best predict the dominant object class in each frame. Our systems are trained to integrate information across the whole clip and can arrive at predictions like “someone is pretending to pick up a black remote control”.
Continuing in the tradition of classic computer vision, some skepticism towards end-to-end learning has been driving a strong push towards “pixel labeling” and object segmentation, after the initial ImageNet successes happened in recent years. The hope was that a system that can segment hands and objects (possibly using a neural network), can then be used to engineer a solution that can detect that an object is picked up.
In complete opposition to that approach, we are training our networks end-to-end on a lot of carefully collected data. This ensures that an understanding of objects and their relations have to emerge in response to the subtle distinctions required by the task. Over the next months, we will continue our push towards the goal of building machines that can truly perceive the world like humans.
Join us this week at the RE•WORK Deep Learning Summit in Montréal and Nvidia’s GPU Technology Conference 2017 in Munich where we will have live demonstrations of our system in action. We will also explain how our systems at TwentyBN drive commercial value for our customers, and speak about our long-term AI and product agenda to learn common-sense world knowledge through video.
If you want to learn more about our work, please reach out to us.