VIOLIN: Do you really understand videos?

A new task and dataset that requires multi-modal understanding of videos and text

Rohit Pillai
The Startup
4 min readMay 3, 2020

--

Photo by Cameron Mourot on Unsplash

Our media consumption through visual means has grown drastically over the past decade. In fact, between 2013 and 2018, the rate of video consumption grew by a stunning 32% every year and we’re showing no signs of slowing down. Social media platforms like Tiktok, Pinterest and Instagram are filled exclusively with images and videos and businesses are increasingly switching to visual platforms like PowerBI, Tableau and SAS Visual Analytics. With such a large amount of our information coming from visual cues, the natural next step for machine learning researchers is to build models that can understand and analyze them.

The field of computer vision has been around to focus on understanding images way before the advent of deep learning and they’ve done a very good job at this. Today’s computer vision models can detect and classify objects in images (even if there are multiple) and even retrieve images that are similar to an input image (think Google image search). They’re even capable of combining their knowledge of images and understanding other modalities like text. An example of this would be the Visual Entailment task. Given an image, a premise and hypothesis (both in text), a model is expected to judge whether the hypothesis can be confirmed based on the image. Similarly, Visual QA and Visual Dialog require models to answer multiple complex questions about an image. Image Captioning is another multi-modal task where models must generate a caption for a provided image.

While images have been extensively explored in both unimodal and multi-modal contexts, the same cannot be said for videos. There are several challenging unimodal tasks for video like action recognition, object detection and video classification. However, there are only a few tasks that combine videos with other modalities. A few examples of multi-modal tasks include Video QA (answer questions about a video), Video Captioning (caption a video) and Video Reasoning (identifying events in a video, their causes and potential following events).

To further multi-modal understanding of videos and text, the researchers at Microsoft Dynamics 365 AI Research proposed a new task, Video-and-Language Inference. This is like Visual Entailment but for videos. Given a video clip with aligned subtitles as premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip.

In lieu of this task, the team also created a new dataset called VIOLIN (VIdeO-and-Language INference). The dataset is comprised of 95,322 video-hypothesis pairs from 15,887 video clips, spanning over 582 hours of video collected from popular TV shows like Friends and How I Met Your Mother as well as from movie clips from YouTube. While datasets like TVQA that use clips from TV shows, they focus on identifying explicit information from the video. However, in order to successfully answer all the examples in the VIOLIN dataset, a model is required to not only be capable of identifying explicit information (e.g., identifying objects and characters in the video) but also must have in-depth commonsense reasoning abilities (e.g., inferring causal relations of events in the video). The explicit information identification comprises 54% of VIOLIN and the remaining 46% requires commonsense reasoning, making this a significantly more challenging dataset than anything currently available.

Some examples of different types of reasoning from the dataset

The team also built a baseline model to benchmark against for this dataset. It combines video features, that are extracted using a CNN like ResNet, and text features (both subtitles and hypothesis) , encoded by an encoder like the one used by BERT, using 2 fusion modules and a LSTM. There’s an FC layer on top of this and a sigmoid for normalization after which the prediction is made.

Baseline model architecture

So what are you waiting for? Go on and start working on this new challenge and hopefully you’re able to beat this baseline!

Here’s a link to our paper if you want to know more about VIOLIN’s composition, how it was made or the proposed baseline model, and click here to see more of our publications and other work.

References

  1. Jingzhou Liu, Wenhu Chen, Yu Cheng, Zhe Gan, Licheng Yu, Yiming Yang, Jingjing Liu, VIOLIN: A new Video-and-Language Inference task, The Conference on Computer Vision and Pattern Recognition (CVPR) 2020.

--

--

Rohit Pillai
The Startup

I’m an engineer at Microsoft Dynamics 365 AI Research and I’ll post our new NLP, CV and Multimodal research . Check out https://medium.com/@rohit.rameshp