ViViT 📹 Video Vision Transformer

ICCV 2021 ✨, Google Research

Published in

AIGuys

8 min readMar 4, 2022

This article is the fourth paper of the “Transformers in Vision” series, which comprises summaries of the recent advanced papers, submitted in the year range of 2020–2022, to top conferences, focusing on transformers in vision.

*NerdFacts-🤓 have additional intricate details, which you can skip and still be able to get a high-level flow of paper!

✅ Background

After the booming entry of vision transformers, aka ViT, into computer vision’s world. Researchers started applying them to other computer vision tasks. ViTs were originally tested for image classification, recent later works tried vision transformer-inspired networks, were created for other image-based tasks, like DeIT was evaluated on image classification, SWIN was tested for image classification, object detection, and semantic segmentation, etc. You might wanna know what is ViT, before going further.

Along these lines, parallel research was active for performing similar tasks but on videos, instead of images. But … Sam 👦🏻 might think? why do I need a separate algorithm for videos?

You are explaining to Sam, why you need different models for videos?

Well, you can tell Sam👦🏻, Hey! Performing object detection, classification, and segmentation on videos are different from just applying the same image algorithm on every frame because this approach ignores, the temporal relations between frames, which is a very important factor for developing a strong understanding of content.

One of the main differences between image and video classification is the addition of a fourth dimension, time, to the input. Today’s paper is from Google’s research lab and in it, authors explore five different model variants for performing video classification using pure transformer-based 🚀 architecture on different video classification benchmark datasets.

ViViT 📹

Let’s dive into ViViT aka video vision transformer. In ViViT, authors wanted to extend ViT for videos so that they are able to capture the temporal correlation between frames most effectively and develop a pure transformer-based 🚀 model that can perform relatively better than other networks for video classification on benchmark datasets!

1. Embedding Video Clips 🎞️

For classifying video samples, we need to pass them through our model. But exactly how to do that? In ViT, authors had to perform a similar task of classification but on images. So what did they do? Well, they cut an image into small patches of the same size, treated each patch as a “token”. Each token is flattened, passed through a linear layer, adds up to respective position embedding, and is passed to the model.

But in videos, we don’t have just one image, in fact, each sample is a video and is made of multiple frames. The authors of ViViT propose two methods for embedding video samples for passing through a model.

1.1 Uniform Frames Sampling — (like ViT)

In this embedding approach, we apply ViT like patch creation on each frame in the video sample. In this embedding approach, each token is a patch extracted from a frame. We can see this approach is encoding the spatial and temporal location of each patch poorly. Because even if we add positional embeddings to the tokens, we are still missing out on the exact frame and time index at which that patch is present in the video sample.

1.2 Tubelet Embedding 🥖

Instead of extracting patches from each frame, for making tokens in a video sample, authors suggest a new type of token, which captures the time dimension as well. Yes! instead of extracting a flat patch from each frame of the video sample, the authors suggest that we extract a series of patches from a video clip, aka a tube.

We can see that in Tubelet embedding, each token is a Tubelet and each token captures the information about how over time this patch changes in the video sample, this beautiful temporal change track was missing from the simple patch-based token.

2. Video Vision Transformers 📹 ✨

In this paper authors proposed 4 different variants of pure transformer-based video classification models inspired by ViT, for performing video classification. Let’s see each model quickly:

Model-1: Spatio Temporal attention 🔍

The first model is pretty straightforward, we tokenize a video sample using Tubelet embedding-based approach and treat each Tubelet as a token. Next up, each token is passed through a patch embedding layer and we add a position encoding to it and pass all tokens through a standard transformer encoder. In addition to these Tubelet tokens, we also pass an additional learnable parameter, the CLS token, through the transformer. The output of this encoder is the CLS token, which is passed through an MLP(simple feed-forward network) and softmax activation gives up a probability distribution of the target label for the video.

P.S. Remember this red-bordered Multi headed self-attention layer in the transformer block. This is altered in model-3 and 4.

Model-2: Factorized Encoder-Decoder 💔

The second model is called factorized encoder model because in this one we don’t use the same encoder model for all tokens of the video sample. Instead, we first split our video into small clips.

(The left part of the figure) Each clip is passed through the spatial transformer and we get an encoding vector for each clip. These encoding vectors, one per clip, along with a CLS token get added to respective position embedding and we pass them through a temporal transformer encoder which is quite similar to the standard Transformers encoder. The output of the temporal transformer is its temporal CLS token which is then passed through an MLP head and gives a classification label for the video. For temporal transformer, each token is a vector extracted from a clip, so each token is from a different temporal index in the video.

(The right part of the figure) But how does the spatial transformer, treat each clip. Each clip is divided into Tubelets and each Tubelet is treated as a token. We pass these Tubelets and the spatial CLS token from the spatial transformer. The encoded feature vector of each token and the encoded CLS token is returned by the spatial transformer. There are two possible outputs mentioned in the paper, one is to perform global average pooling (GAP) on all encoded tokens and return that OR just return the CLS token extracted from the clip. For spatial transformer, each token is a Tubelet extracted from one clip and all tokens are from the same temporal but different spatial indexes.

Model-3: Factorized Self attention 🌜🌛

The third model is exactly similar to the first one, but the only change here is the transformer encoder block that is used is not the standard block used in original transformers. This new transformer block again is pretty similar to the original standard transformer block, the only difference being the multi-headed self-attention. The MSA layers are factorized or broken into two parts. So unlike model 1 attention won’t be computed among all the tokens.

The first multi-headed self-attention layer computes attention between all the tokens extracted from the same spatial index (as in among all the tokens extracted from the same clip) and then the temporal self-attention layer computes attention between all tokens extracted from a different temporal index.

Model-4: Factorized dot product attention 🍒

Model 4 is exactly similar to model 1 in terms of architecture, so I am not repeating the figure here again. But here gain only difference is the MSA layer of the transformer block. So in model 4, the authors went on the more fine-grained level and factorized the dot product attention heads between the spatial and temporal aspects. Here half of the heads in the MSA layer were computing the dot product self-attention between tokens extracted from the same spatial index and the remaining heads were computing the dot product self-attention between the tokens extracted from different temporal indexes.

This model is different from model 3 because in that model authors were trying to compute first spatial and temporal attention using all heads and two separate MSA layers, whereas, in model 4, authors used different heads in the same MSA layer to compute temporal and spatial self-attention.

Whew!… that was quite a video classification model search. So, now Sam👦🏻 wants to know which model was the best?

3. RESULTS 🥇🥈🥉

ViViT best performing models on 5 benchmark video classification datasets

The author compared all four models with convolutional SOTa networks and other transformer-based networks in terms of FLOPs, accuracy, and its trade-off with their ViViT models, and reported the best performing model among four models, on each dataset.

It can be seen second model BiBiT FE(Factorized encoder), the model performed better than all other model variants on all 5 datasets including Kinetics-400, 600, Epci Kitches, and Something Something v2 datasets.

So among all four models Model, 2 was the best performer — 🏆MODEL 2.

On the Kinetics dataset, authors had to initialize model 2’s bigger variants on private JFT-300 datasets weights. Since there are generally not as big datasets for video classification tasks available, like ImageNet, etc. hence pre-training was not an option. The authors observed that initializing a video classification model with image classification weights improves the accuracy of the model.

[NerdFact-🤓: Why did Model 2 out-performed all other models? Divided space-time attention!]

All ViViT model variants were trying to combine the spatial and temporal tokens effectively to develop a better semantic representation of the video. Model 2, first attends each clip individually and then attends all clips together. This approach is called divided space-time attention because we are first attending space / spatial aspect and later the temporal aspect of the video sample. This idea of divided space and time attention has proven to be useful for video understanding by Facebook AI Research lab too, in their paper TimeSFormer, in which they explored varying spatial and temporal attention approaches and found divided space and time attention approach to stand out amongst all methods.

What will happen if we reverse the order of space and time attention computation in model 2?

How about using a 3D convolutional embedding for Tubelets? The authors used it and found it better than linear embedding layer.

How can we improve model 2?

Are CLS tokens better representation of spatial attention or pooled spatially encoded features? I think CLS tokens.

What will happen if we convert model 2 to hybrid architecture, use a convolutional network for spatial encoding and a transformer for temporal encodings?

Happy Learning! ❤️