MultiView Transformers📺 — Part I

Google Research 🚀 | CVPR2022 🏆 | State-Of-the-Art 🔥

Momal Ijaz
AIGuys
5 min readMay 1, 2022

--

This article is the sixth paper of the “Transformers in Vision” series, which comprises summaries of the recent advanced papers, submitted in the year range of 2020–2022, to top conferences, focusing on transformers in vision.

✅ Background

A video comprises multiple actions which can occur over a small span of time or over longer durations. For developing a strong semantic understanding of video, for classification or detection, or segmentation, we need a deep learning architecture that can comprehend these shorter and longer, both types of actions.

Previous attempts for understanding objects of varying resolution in an image, involved building pyramidal neural network architectures. In videos, for capturing and understanding actions occurring at varying Spatio-temporal resolutions, input sampling or two input streams (Fast: 2 skip frames, slow: no skip frame), etc. have been used. These approaches involved convolutional architectures.

For transformer-based image classification architectures, fusing information from crops of varying resolutions, in a vision transformer-like architecture, has given promising results, CrossViT[1]. Although fusing information from multiple temporal resolutions, in video classification, for transformer-based architecture was not tested so far, and that’s exactly what this paper does!

Authors combined all good practices for video classification from ViT (vision transformer), ViViT (video vision transformer), and CrossVit[1], and got state-of-the-art performance in video classification, on 5 benchmark datasets.

MultiView Transformer 📺 :

This paper is very clearly written and represents a beautiful technique for fusing information across varying temporal resolutions in a video, to develop a better semantic understanding of its contents. Let’s break the architecture down, to understand the secret sauce behind the magic!

MultiView Transformer 📺 for Video Classification

1. MultiView Tokenization 🔣 :

First of all, the input videos need to be tokenized to pass it to a transformer architecture. Just like in ViViT, authors tokenize the input videos by dividing them into “tubelets”. For capturing the actions happening at varying temporal resolutions, the authors create different views of the video. In each view, the authors use a different tube size for tokenizing the video. There are two types of views :

Larger Views: The views in which the tube size is larger, and hence the overall number of tokens extracted is smaller.

Smaller Views: The views in which tube size is smaller, and hence the overall number of tokens extracted is larger.

Larger View vs. Smaller Views. Each slab is a frame, each tube is a token. A larger view has fewer and a smaller view has more tokens.

For extracting tokens for a view, we pass the video through a 3D convolutional layer with a kernel size equal to the tube size of the view. This can be seen as the equivalent of dividing the video into tubes and passing each tube through a linear layer for embedding.

2. MulitView Transformer 🚀 :

The Multiview transformer (MVT) comprises n encoders, where n is the number of views. Each encoder is responsible for processing tokens of each view. The overall architecture of each encoder is kept almost similar to the original Transformer’s[2] encoder block. Each encoder has L layers.

Spatial Attention: Within each layer, authors compute attention between tokens from the same temporal index which saves a lot of computational complexity and makes the model efficient. For computing correlations between tokens from the different temporal indexes, we have a main global encoder, which I’ll discuss in the upcoming section.

Spatial attention computation in Multiview Transformer🚀 Same colored frames have the same temporal index, and attention encoding of red tokens will be computed w.r.t other tubes in light blue frames.

CrossView Fusion 💣:

There is a lateral connection between each view Encoder, that fuses information from the previous view into tokens of this transformer. Particularly larger views attend to smaller views. This sets up a chain of information flow from smaller views to larger views, which allows the model to pay attention to very small duration actions, then a little longer and a little longer up until the largest view, which has information encoded from all temporal resolutions.

CrossView Attention in MultiView Transformer🚀, i+1 is the smaller view, which is attended by the larger view i.

For fusing information from a smaller view into a larger view, the smaller view tokens, after the self-attention encoding layer are passed through a linear layer. This linear layer projects the tokens up to a dimension that matches the token dimension of the next larger view. These upscaled tokens are passed to the next encoder’s cross-view attention(CVA) layer. In CVA, the up-scaled tokens from the previous layer act as keys and values, whereas the tokens of this act as queries.

First, the tokens in each view’s encoder are self-encoded (by passing them through the multi-headed self-attention layer). These self-encoded tokens are then passed to the CVA layer of the next larger view, to allow the larger view tokens to fuse information from the smaller view.

3. Fusion Locations :

For fusing the information from smaller views into larger views, the authors decided not to include a CVA layer in each encoder block of a given view encoder. Rather they left it as a design choice, and after performing ablation studies, they found out that fusing information from smaller views into a larger view encoder, gave the best results when CVA layers were put in late layers, followed by mid and they got the least results for CVA layers in early blocks. So they put CVA layers in mid and late encoder blocks.

Fusion location example, for a single view encoder. The late blocks are the best location fusion, followed by mid, then late.

4. Global Encoders:

Finally for computing correlation among all views, in addition to subsequent views, authors extract a CLS token from each view. This token is representative of the information encoded up until that view from the smallest view. These tokens are then passed to a standard simple global encoder. The output of this encoder is another CLS token that is passed to a multi-layer perceptron, which predicts the final class of the video!

CLS tokens from each view Encoder go into the main global encoder, to get the final class of the video!

Voila!

Be proud, you just learned how a super novel (2022), state-of-the-art video classification network🔥, invented by Google, works. Go brag about it 😁, by sharing your learning on Linkedin, your thoughts about why the architecture worked well, how you would have done things differently, and what not!

We will dive into experiments and results of the network, in part two of the article.

Happy Learning! ❤️

References 📗:

[1] Chen, Chun-Fu, et al. “CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification.” 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021): 347–356.

[2] Vaswani, Ashish et al. “Attention is All you Need.” NIPS (2017).

--

--

Momal Ijaz
AIGuys
Writer for

Machine Learning Engineer @ Super.ai | ML Reseacher | Fulbright scholar'22 | Sitar Player