Towards Situated Visual AI via End-to-End Learning on Video Clips

Published in

twentybn

3 min readOct 23, 2020

We are excited to announce the public release of Sense.

Sense is an open source inference engine for neural network architectures that takes an RGB video stream as input and transform it into a corresponding stream of labels in real-time. Labels may include actions such as day-to-day human actions (picking up objects, drinking water, fixing your hair, etc.), hand gestures, fitness exercises and more.

Architecture, Model, and Applications

A key feature of the architecture is that it is trained end-to-end to go from pixels to activity labels instead of making use of bounding boxes, pose estimation or any form of frame-by-frame analysis as an intermediate representation.

Training data takes the form of short labelled video clips (average length about four seconds) showing a wide range of activities happening in front of the camera. Labels for the large majority of the data are “holistic” in the sense that a single label is assigned to the whole video clip.

Deployment of the trained network is online, or in other words, the network transforms an input stream of pixels to an output stream of “concepts” on the fly.

The repository currently contains two 3D CNNs with different trade-offs between accuracy and computation. The network architectures consume 16 frames per second, produce 4 predictions per second while maintaining a fairly small computational footprint. The smaller network runs efficiently on desktop and mobile CPUs whereas the larger network works well on common accelerators (Apple’s A11 and up, or Qualcomm Snapdragon). Read the source code to find out more.

Additionally, model weights pre-trained on several million labelled videos across several thousand classes, are available at our SDK page. You can explore two potential applications of these models on our repository:

A real-time gesture and action classifier
A real-time, vision-based calorie counter that measures the calories that the user working out in front of the camera is burning (“pixels in — calories out”)

That’s not all though, we also provide an easy-to-use script to train your own custom classifier by fine-tuning the final layers of our model on your dataset. Simply organise your dataset as specified and launch our script. On some transfer learning tasks the model has proven powerful in a few-shot learning scenario with only 2–3 samples per class!

Both the examples we provide, and custom classifiers, can be easily deployed on an iOS device with a small app available through our sense-iOS repository.

Upcoming Use Cases

We plan to add more example use cases on Sense in the coming weeks and look forward to showcasing community projects.

In collaboration with our industrial partners, we have applied pre-trained networks like these across a range of use-cases requiring real-time vision in the past, including:

Gesture control (in smart home devices, smart kiosks, cars)
Human action recognition (in smart home devices, cars, public spaces, video calls)
Fitness tracking
Human-computer interaction (e.g., is the user talking to me or someone else?)
AR (gesturing from an “ego” perspective)
Enabling “digital humans” and virtual beings to interact with users

The network and data also power the tracking capabilities of the embodied AI fitness coach in this consumer app.

Collaborate with 20BN on Situated AI

We believe that an AI system that learns to process visual information online and in the context of interactions and conversations with humans will play a key role in advancing AI towards visual grounding and, eventually, “reasoning”. We hope that this release will further the research community’s efforts on making AI systems more situated and that it will drive the development of additional use-cases and the adoption of situated, interactive AI.

Authors: Guillaume Berger, Antoine Mercier, Florian Letsch, Cornelius Boehm, Sunny Panchal, Nahua Kang, Mark Todorovich, Ingo Bax, Roland Memisevic

Image credit: An (Visual Artist @ 20BN)

Towards Situated Visual AI via End-to-End Learning on Video Clips

Architecture, Model, and Applications

Upcoming Use Cases

Collaborate with 20BN on Situated AI

Written by Twenty Billion Neurons