Learning about the world through video

At TwentyBN, we build AI systems that enable a human-like visual understanding of the world. Today, we are releasing two large-scale video datasets (256,591 labeled videos) to teach machines visual common sense. The first dataset allows machines to develop a fine-grained understanding of basic actions that occur in the physical world. The second dataset of dynamic hand gestures enables robust cognition models for human-computer interaction. For more, visit our dataset page, research paper or contact us.

Image for post
Image for post
Example videos from our datasets

Video is becoming ubiquitous

Video plays an increasingly important role in our lives. As consumers, we collectively spend hundreds of millions of hours every day watching and sharing videos on services like YouTube, Facebook or Snapchat. When we are not busy gobbling up video on social media, we produce more of it with our smartphones, GoPro cameras and (soon) AR goggles. As a growing fraction of the planet’s population is documenting their lives in video format, we are transitioning from starring in our own magazine (the still image era) to starring in our own reality TV show.

Image for post
Image for post

Video is the next frontier in computer vision

Deep Learning has made historic progress in recent years by producing systems that rival — and in some cases exceed — human performance in tasks such as recognizing objects in still images. Despite this progress, enabling computers to understand both the spatial and temporal aspects of video remains an unsolved problem. The reason is sheer complexity. While a photo is just one static image, a video shows narrative in motion. Video is time-consuming to annotate manually, and it is computationally expensive to store and process.

Image for post
Image for post
Existing computer vision systems produce (at best) descriptions of the world that are not robust. Here are a couple of examples produced by a model that generates natural language descriptions of images (Source: Karpathy & Fei-Fei)

A novel approach to video understanding

One of the most important rate limiting factors for advancing video understanding is the lack of large and diverse real-world video datasets. Many video datasets that have been published to date suffer from a number of shortcomings: they are often weakly labeled, lack variety, or underwent a high degree of editing and post-processing. A few notable exceptions, like DeepMind’s recently released Kinetics dataset, try to alleviate this by focusing on shorter clips, but since they show high-level human activities taken from YouTube videos, they fall short of representing the simplest physical object interactions that will be needed for modeling visual common sense.

1. “Something-something” dataset

This snapshot contains 108,499 annotated video clips, each between 2 and 6 seconds in duration. The videos show objects and the actions performed on them across 175 classes. The captions are textual descriptions based on templates, such as “Dropping something into something”. The templates contain slots of “something” that serve as placeholders for objects. This provides added structure between the text-to-video encoding for the network to improve learning.

Image for post
Image for post

2. “Jester” dataset

This snapshot contains 148,092 annotated video clips, each about 3 seconds long. The videos cover 25 classes of human hand gestures as well as two “no gesture” classes to help the network distinguish between specific gestures and unknown hand movements. The videos show human actors performing generic hand gestures in front of a webcam, such as “Swiping Left/Right,” “Sliding Two Fingers Up/Down,” or “Rolling Hand Forward/Backward.” Predicting these textual labels from the videos requires a network that is capable of grasping such concepts as the degrees of freedom in three-dimensional space (surging, swaying, heaving, etc).

Image for post
Image for post

Key characteristics of both datasets

  • Supervised learning: In contrast to other methods that seek to acquire common sense through the use of predictive unsupervised learning, we phrase the task as a supervised learning problem. This makes the representation learning task more tractable and defined.
  • Dense captioning: The labels describe video content that is restricted to a short time interval. This ensures there is a tight synchronization between the video content and the corresponding caption.
  • Crowd-acted videos: In contrast to other academic datasets that source and annotate video clips from YouTube, we created our datasets using crowd acting. Our proprietary crowd acting platform allows us to ask crowd workers to provide videos given caption templates instead of the other way around. This facilitates the generation of labeled recordings rather than just the labeling of existing videos.
  • Human focused: With the exception of motion “textures” like ocean waves or leaves in the wind, most complex motion patterns that we ever see are caused by humans. Our datasets are human-centered to have the complex spatio-temporal patterns that encode features of articulation, degrees of freedom, etc.
  • Natural video scenes: Our videos were captured with many different devices and varying zoom factors. The datasets feature scenes with natural lighting, partial occlusions, motion blur and background noise. This assures that the datasets can transfer to real-world use cases with minimal domain shift.
Image for post
Image for post

The practical use of visual common sense

How do we go from an understanding of physical concepts to offering practical, real-world solutions? We believe the answer is to be found in a technical concept called transfer learning.

How to get the data and where to benchmark your results

The two datasets are available for download on our website. You can find more information about the datasets and the technical specifics of our research in this technical report. If you want to benchmark the accuracy that your very own model achieves on the datasets, you will be able to upload your results to our website to be ranked in a leaderboard. If can want to license our datasets for commercial use, please reach out to us.


We teach machines to perceive the world like humans.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store