Towards Situated Visual AI via End-to-End Learning on Video Clips

Image for post
Image for post

We are excited to announce the public release of Sense.

Sense is an open source inference engine for neural network architectures that takes an RGB video stream as input and transform it into a corresponding stream of labels in real-time. Labels may include actions such as day-to-day human actions (picking up objects, drinking water, fixing your hair, etc.), hand gestures, fitness exercises and more.

Architecture, Model, and Applications

A key feature of the architecture is that it is trained end-to-end to go from pixels to activity labels instead of making use of bounding boxes, pose estimation or any form of frame-by-frame analysis as an intermediate representation.

Training data takes the form of short labelled video clips (average length about four seconds) showing a wide range of activities happening in front of the camera. Labels for the large majority of the data are “holistic” in the sense that a single label is assigned to the whole video clip.

Deployment of the trained network is online, or in other words, the network transforms an input stream of pixels to an output stream of “concepts” on the fly.

The repository currently contains two 3D CNNs with different trade-offs between accuracy and computation. The network architectures consume 16 frames per second, produce 4 predictions per second while maintaining a fairly small computational footprint. The smaller network runs efficiently on desktop and mobile CPUs whereas the larger network works well on common accelerators (Apple’s A11 and up, or Qualcomm Snapdragon). Read the source code to find out more.

Additionally, model weights pre-trained on several million labelled videos across several thousand classes, are available at our SDK page. You can explore two potential applications of these models on our repository:

That’s not all though, we also provide an easy-to-use script to train your own custom classifier by fine-tuning the final layers of our model on your dataset. Simply organise your dataset as specified and launch our script. On some transfer learning tasks the model has proven powerful in a few-shot learning scenario with only 2–3 samples per class!

Both the examples we provide, and custom classifiers, can be easily deployed on an iOS device with a small app available through our sense-iOS repository.

Upcoming Use Cases

We plan to add more example use cases on Sense in the coming weeks and look forward to showcasing community projects.

In collaboration with our industrial partners, we have applied pre-trained networks like these across a range of use-cases requiring real-time vision in the past, including:

The network and data also power the tracking capabilities of the embodied AI fitness coach in this consumer app.

Collaborate with 20BN on Situated AI

We believe that an AI system that learns to process visual information online and in the context of interactions and conversations with humans will play a key role in advancing AI towards visual grounding and, eventually, “reasoning”. We hope that this release will further the research community’s efforts on making AI systems more situated and that it will drive the development of additional use-cases and the adoption of situated, interactive AI.

Authors: Guillaume Berger, Antoine Mercier, Florian Letsch, Cornelius Boehm, Sunny Panchal, Nahua Kang, Mark Todorovich, Ingo Bax, Roland Memisevic

Image credit: An (Visual Artist @ 20BN)

twentybn

We teach machines to perceive the world like humans.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store