MultiView Transformers📺 — Part II

Google Research 🚀 | CVPR’22 🏆 | State-Of-the-Art 🔥

Published in

AIGuys

4 min readMay 21, 2022

This article is part 2 of the sixth paper of the “Transformers in Vision” series, which comprises summaries of the recent advanced papers, submitted in the year range of 2020–2022, to top conferences, focusing on transformers in vision.

NerdFacts-🤓 have additional intricate details, which you can skip and still be able to get a high-level flow of paper!
Paper Link: https://arxiv.org/pdf/2201.04288.pdf

✅ Background

This article is part 2 of the MultiView Transformer. The first part comprises a comprehensive overview of the architecture of the MultiView Transformer, which is the current state-of-the-art deep learning architecture for performing video classification. MVT is a convolutional free, purely transformer-based neural network, that uses encoders from a transformer and processes multiple views (“tube-lets” of varying frame length), using different encoders and then combines them using a single global encoder, to predict the final class of the video.

MultiView Transformer Architecture (MVT)

Experiments 🧪 and Results 🏆

The authors tested the MultiView Transformer architecture on 5 different benchmark video classification datasets. These datasets are

Kinetics: High-resolution human action video classification dataset, with three variants Kinetics-400/600 and 700. Where 400,600 and 700 is the number of classes in each variant.
Moments in Time: 3 sec 800K labeled videos, involving humans, animals, objects, etc.
Epic-Kitchens 100: 90K video samples, recorded in the kitchen, by a camera mounted on the head of the chef.
Something-Something V2: 220K+ video clips, comprising of humans interacting with objects.

MultiView Transformer was evaluated on all of the above datasets and it outperformed all previous state-of-the-art methods, by a good margin.

Performance of MVT on 5 benchmark datasets. The results in bold reflect it outperforms all convolutional and transformer-based vision models.

Comparing the performance of MVT with the previous “multi-view” convolutional video classification architecture Slow-Fast reflects that it clearly outperforms it by a good margin on all the datasets. Other transformer-based video classification models like ViViT (Video vision transformer by Google), MViT, and TimeSformer(Video classification network by Facebook AI Research), all majorly outperform the convolutional SlowFast network but underperform as compared to MVT.

This astounding performance of the MultiView Transformer makes it stand out as it holds the status of the state-of-the-art in video classification, as per PaperWithCode, as of today, May 21st, 2022.

Ablation Studies 🧐

Some of the ablation studies that authors performed on the MVT using Kinetics-400 datasets are as under:

Model view Assignment: Authors had two types of tubelets, small and larger. The smaller tubelets were of size 16x16x2(corresponding to an overall larger number of tokens) and the larger tubelets were 16x16x8(corresponding to an overall smaller number of tokens). Authors discovered that using larger models for smaller views and smaller models for larger views gave better results, as smaller views have more details and larger views capture the gist of the video, making classification easier.
CrossView Fusion Method: In the MVT transformer, authors use multiple encoders, and each encoder processes a different view of the video sample. To fuse information between these varying view encoders, the authors explore three different types of fusion techniques, and attention-based cross-view attention fusion performs the best.
Number of Views: Authors observed an increasing number of views from 2 to 3, for the Kinteics400 dataset, which gave a 0.3% improvement in results.
Location of Fusion: Authors also observed that the best stage to perform fusion in the MVT between different encoders is late, followed by the middle and then the early stage.
Comparison with SlowFast: SlowFast is a famous convolutional video classification architecture, which processes videos using two different frame rates, slow and fast(taking every nth frame). The authors replaced the slow-fast convolutional architecture with transformer encoders to see if it performs better at video classification than MVT on Kinetics400, but it didn't, which proves the outstanding performance of MVT cannot entirely be credited to transformers' smart design choice of MVT also plays a crucial part in its success.

Conclusion

MultiView Transformer is a convolution free transformer-based video recognition architecture, that was invented by Shen Yan and his team, while he was an intern at Google. This architecture went on to become state of the art for video classification and presents some very smart and beautiful concepts like:

a. Multi-view processing for better semantic understanding and feature learning of videos.

b. Cross-view attention fusion technique, which can be used whenever one wants to fuse information from two different streams of data.

c. Beautiful fusion of successful ideas from previous works like tubelets from ViViT, fusion from CrossViT, divided space-time attention from TimeSformer, etc. to come up with one perfect video recognition architecture.

Although, there is still a large room for improvement. One can also explore the usage of learned features for other vision tasks like segmentation and detection, with appropriate heads. Also, MVT gave better results, after being pre-trained on large video or image datasets, working on removing this dependency to allow for this transformer-based model to converge on smaller video classification datasets, is also an important area of future research.

Happy Learning! ❤️