ViViT : A Video Vision Transformer

Nitin Limhan
Machine Intelligence and Deep Learning
12 min readMay 2, 2022

Blog by Nitin Limhan and Rutuja Pisal

ViViT : A Video Vision Transformer

In recent years, Pure-Transformer based models, have advanced the state-of-the-art in many standard datasets for sequence modelling and transduction problems. They were first introduced in the “Attention Is All You Need” paper and emerged as the preferred model in natural language processing (NLP), but their applications in vision domain were limited. A Google Research Team introduced the paper “An Image Is Worth 16 X 16 Words : Transformers For Image Classification At Scale”, which applied the Transformer based models to Image classification. Inspired by this, Vision Transformer was extended for video classification which is explained in the paper “ViViT : A Video Vision Transformer”.

This is the paper that we have explored in this blog wherein we will see how Video Vision Transformers work.

To fully comprehend the paper, one must first understand the ideas of Self-attention and Vision Transformers, before moving on to Video Vision Transformers.

What is Self Attention ?

Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations.

What is Transformer ?

Transformer is a model architecture that foregoes recurrence in favor of drawing global dependencies between input and output wholly using an attention mechanism. After only twelve hours of training on eight P100 GPUs, the Transformer allows for substantially higher parallelization and can achieve a new state of the art in translation quality.

What is ViT ?

The Vision Transformer, or ViT, is a model for image classification that employs a Transformer-like architecture over patches of the image. An image is split into fixed-size patches, each of them are then linearly embedded, position embeddings are added, and the resulting sequence of vectors is fed to a standard Transformer encoder. In order to perform classification, the standard approach of adding an extra learnable “classification token” to the sequence is used.

Transformers were initially just used in NLP but what can be seen in ViT : Vision Transformers, that they are now challenging Convolutional Neural Networks (CNNs) which have been the backbone of Vision related problems since many years. The fundamental difference between these two is the Local Vs Global Context. Its known that the CNNs only look for dependencies in and around the pixel window which focuses on the local context whereas Transformers have the bigger picture because of pre tokenization embeddings which focuses on the global context. And therefore transformer understand complete picture quickly and perform better. In the case of videos, this implementation is little complicated. One needs to know what spatial and temporal dependencies are, in order to grasp it better. When we are trying to segment an image and trying to find out how a pixel is related to its neighbors or other pixels in the image, these are known as spatial dependencies. There are some applications in activity recognition for which we have to process multiple frames because there are dependencies on the time axis as well, and these are known as temporal dependencies.

ViViT : Pure-Transformer architecture for video classification

It can be observed from ViViT’s design that it uses the same transformer encoder as well as the same Position and Token Embedding mechanism as ViT: Vision Transformers. The only difference is tokenizing more than one frame in one go and also let the transformer learn the spatial and temporal dependencies. The first thought that comes to everyone’s mind is to give the entire video as input and let the transformer learn, which is actually the first model proposed by the authors, but the issue is the number of tokens because the attention mechanism requires computation between each pair of tokens. This is a very critical issue especially in videos since it requires to process more tokens for any video task.

Let’s see how these problems are resolved by the authors.

Embedding Video Clips

Two simple methods were considered for mapping a video to a sequence of tokens. Positional embeddings are added to this and then given as input to the Transformer. The two embedding methods are :

  • Uniform Frame Sampling

It’s a simple and straightforward approach for tokenizing the input video clip. First, sampling of frames from the input clip in a uniform manner is done. Second, the same mechanism as ViT: Vision Transformers to embed each 2D frame separately is used. Finally, concatenate all of these tokens together. From common sense, it can said that this process is simply constructing a very large 2D image to be tokenized following ViT.

Uniform Frame Sampling
  • Tubelet Embedding

Tubelet embedding starts with extracting non-overlapping, spatio-temporal tubes from the input video clip and then linearly projects it. This method is an extension of ViT’s embedding to 3D and also corresponds to 3D convolution.

Tubelet embedding

As it can be seen from the figure, during the tokenization process, tubelet embedding method fuses the spatio-temporal information. Whereas in uniform frame sampling, only spatial information is fused during tokenization and temporal information from different frames is fused later by the transformer.

Models Proposed

In this paper, the authors proposed 4 different transformer-based models for video classification. They start with a simple modification of ViT that simulates pairwise interactions between all spatio-temporal tokens and then move on to more efficient forms that factor the input video’s spatial and temporal dimensions at various levels of the transformer architecture.

Model 1: Spatio-temporal Attention

All spatio-temporal tokens collected from the video are simply forwarded through the transformer encoder in this model. In contrast to CNN, each transformer layer models all pairwise interactions between spatio-temporal tokens. Now, since the first token of first image can interact with the last token of last image, this results in long-range interactions across the videos from the first layer. As a result, this model has quadratic complexity, which was then overcome in the second model.

Model 1 : Spatio-temporal Attention 📹

Model 2 : Factorised Encoder

In this model, just as the name suggest they factorize the encoder, which means that all the frames are first passed through spatial encoder i.e every single frame is passed through a separate spatial encoder. The output features which are received from the spatial encoder are then passed through the temporal encoder. This facilitates the late fusion of temporal information, as first the spatial information is obtained and then its fused into getting temporal information. There are more parameters in this model, since there are two encoders instead of one, but there are few floating point operations as everything is not interacting with everything directly. The tokens do interact with each other at long range but they do that indirectly and thus the complexity of this model is much less as compared to the Model 1.

Model 2 : Factorised Encoder 📹

Model 3 : Factorised Self-attention

In this model the authors used just one encoder, but inside one encoder they used 2 self-attention blocks, which is the spatial self-attention encoder and the temporal self-attention encoder as can be seen in the below figure. This model has same number of transformer layers as Model 1 since there is just one encoder, but its complexity is same as Model 2 since everything is not interacting with everything directly. First every block calculates the spatial self-attention(among all the tokens extracted from the same temporal index) and then the temporal self-attention(among all tokens extracted from the same spatial index). In simple words, what it does is, it takes the input tokens, reshapes it as number of frames into the spatial dimension, then performs spatial attention. Then it reshapes it appropriately again so that they can be input into the temporal self-attention block and then it performs temporal self-attention, the output of which is passed onto the next transformer block. Overall, experiments with respect to the order of the blocks were also carried out, as to which block should go first. The spatial block should go first or the temporal block should go first, and they found out that it doesn’t matter.

Model 3 : Factorised Self-attention 📹

Model 4 : Factorised Dot-product Attention

The final model proposed is the Factorised dot-product attention, which is like the best of all worlds model. This model has the same computational complexity as Models 2 and 3, while retaining the same number of parameters as the unfactorised Model 1. This model is a bit complex in the sense that, first it gets all attention heads and divides them into 2 parts. The concept of neighborhood is very important here in the sense that, for half of the heads the dot-product attention is computed over only the spatial axis and for the other half over only the temporal axis and then keys are generated for the attention heads. The main idea here is to modify the keys and values for each query to only attend over tokens from the same spatial and temporal index. Then, for half of the attention heads, the model attends over tokens from the same spatial dimension by computing Ys = Attention(Q;Ks;Vs), and for the rest it attends over the temporal dimension by computing Yt = Attention(Q;Kt;Vt). Finally, the outputs of multiple heads is combined by concatenating them and using a linear projection.

Model 4 : Factorised Dot-product Attention 📹

Initialisation by leveraging pretrained models

Because transformers lack some of the inductive biases of convolutional networks, ViT has only been proved to be effective when trained on large-scale datasets. When compared to their image counterparts, even the largest video datasets, such as Kinetics, have orders of magnitude fewer tagged samples. As a result, training large models from scratch to high accuracy is extremely challenging. To sidestep this issue and enable more efficient training the authors initialized the video models from pretrained image models.

There are 3 problems that they faced due to this, which are as follows :

1. Positional Embedding

Problem : Video models have n x t times more tokens than Image model.

Solution : Initialise the positional embeddings by “repeating” them temporally, so all the tokens with same spatial index have the same embedding.

2. Transformer Weights for Model 3 : This problem arises when model 3 is used, since model 3 has 2 self-attention blocks instead of one.

Problem : Model 3 has two MSA modules.

Solution : Initialise the spatial MSA module from the pretrained module and initialise all weights of the temporal MSA with zeroes.

3. Embedding Weights, E : The third and most important problem that was faced was of embedding weights. This problem arises because all ViT models are pretrained using images and have 2D filters, but since ViViT uses tubulet embedding it requires 3D filter.

Problem : Pretrained model has 2D filter but tubelet embedding required 3D filter.

Solution : 1. Inflating Filters :

2. Central Frame Initilisation :

So there are two solutions to the above problem. The first one is inflating filters and second one is central frame initialization. Inflating filters means at every layer weights will be taken corresponding to every image, and it will be replicated t times, where t is the number of frames and then all be divided all by t. This it just like averaging based on location. The second solution was however adopted which involved keeping image weights at central frame and initializing everything to zero.

Experimental Setup

Experiments were performed using ViT Base architecture , having tubelets of size 16 x 16 x 2. The models were trained using Sychronous SGD and Cosine Learning Rate. The models were further pretrained using ImageNet-21K and JFT Datasets.

Ablation Study

1. Input Encoding Methods : Uniform Vs Tubelet Embedding

For this experiment, ViViT base architecture with a spatial temporal attention model on the kinetics 400 dataset has been used. The table below shows the results of the experiment. From the table it can be seen that for uniform frame sampling to generate embedding matrix the accuracy achieved was 78.5%. Further, when different encoding methods were tried, better results were received. Tubelet embedding using the “central frame” approach outperformed the more often used “filter inflation” method giving an accuracy of 79.2%.

ViViT-B / Spatio-temporal Attention on Kinetics 400 dataset

2. Model Variants (1 to 4) : Comparison of Unfactorized and Factorized Models

The next experiment was performed to see which variant of transformer model performed best. On Kinetics 400 dataset, the unfactored model (Model 1) performs the best, but it overfits on smaller datasets like EK.

The authors also investigate a second baseline (last row), based on Model 2, in which they don’t employ a temporal transformer and just pool the frame-level representations from the spatial encoder before classifying. From the table below it can be observed that accuracy for the average pool baseline models drops, which means it needs better architecture to model on the EK dataset.

Comparison of model architectures using ViViT-B as the backbone

Furthermore, metrics like floating point operations, params and runtime is are also used to see the performance of different model variants. For these metrics, it can be seen that unfactorized model requires high computation and hence uses more flops. For the factorized models since attention is calculated independently over spatial and temporal dimensions they use less flops. Model 4 adds no additional parameters to the unfactorised Model 1, and uses the least compute. The temporal transformer encoder in Model 2 operates on only n x t tokens, which is why there is a barely a change in compute and runtime over the average pooling baseline, even though it improves the accuracy substantially.

Comparison to state-of-the-art

On the Kinetics 400 Dataset, various ViViT variations to the various State of the Art models. First, a brief overview of the metrics used to compare the models :

· TOP 1 : So for a prediction, if we get the ground truth as the class having highest probability then we consider that as a correct prediction.

· TOP 5 : If we get the ground truth in the top 5 class having highest probabilities then we consider that as a correct prediction.

· Views : The term Views can be explained using an example of model X3D-XXL. It has 10X3 views which says that it has 10 temporal crops and 3 spatial crops

· TFLOPs — Trillion Floating point Operations per second

Now, moving on to the table, it can be seen that the ViViT-L/16X2 FE having accuracy of 92.7% and a view of 1X1 in Top 5 metric underperforms the best 3D Convolutional Neural Network model which is X3D-XXL which has an accuracy of 94.6%. The ViViT-L/16X2 FE with a view of 1x3 and accuracy of 81.7% outperforms TimeSformer-L with an accuracy of 80.7% in Top 1 metric. Finally, when these networks are trained on large pretraining dataset JFT in order to increase the accuracies, we can see that the ViViT-H/16X2 (JFT) outperforms all other state-of-the-art models.

Comparison to state-of-the-art across multiple datasets

On the Kinetics 600 Dataset, the best variant among the 3D Convolutional Neural Network is X3D-XL which have accuracies of 81.9% and 95.5% on Top 1 and Top 5 respectively. X3D-XL is outperformed by TimeSformer-L which is a transformer-based model giving an accuracy of 95.6% for Top 5 metric. The current state-of-the-art models have been outperformed by ViViT-L/16X2 FE by a little margin of 0.7% on Top 1 metric but still lagged behind on the Top 5 metric. Finally, while leveraging the use of pre-trained image weights from JFT, ViViT-H/14X2 (JFT) performed the best with accuracies of 85.8% and 96.5% for the Top 1 and Top 5 metrics respectively.

On numerous datasets, ViViT significantly beats all state-of-the-art models, as seen in the table above.

Conclusion and Future Work

  • ViViT is successful in implementing regularization and factorization at different levels to produce state-of-the-art models for Video tasks.
  • This model presents a unique way of factorising spatial and temporal dimensions which increased efficiency and scalability.
  • The removal of dependence on image-pretrained models is an area for clear future improvements.
  • For the next step, going beyond video classification towards more complex tasks should be the aim.

References:

[1] “Attention Is All You Need”, 2017, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin, https://arxiv.org/abs/1706.03762

[2] “An image is worth 16 X 16 words: Transformers for image recognition at scale”, 2020, Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, https://arxiv.org/abs/2010.11929

[3] “ViViT : A Video Vision Transformer”, 2021, Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lucic, Cordelia Schmid, https://arxiv.org/abs/2103.15691

--

--