Exploring ViViT: A Deep Dive into Google’s Video Generation Techniques

6 min readAug 13, 2024

My motivation to read this paper

Recently, I delved into a fascinating paper by Google Research on Genie, a model designed for generating controllable videos. This paper introduced the concept of the ST-transformer. However, the explanation and diagrams provided were quite brief, leaving me curious for more details.

In my quest for a deeper understanding, I turned to the paper on C-ViViT (also known as Phenaki), which serves as an enhancement of the ST-transformer approach. Interestingly, C-ViViT itself is based on another method called ViViT.

Following this trail led me to the original ViViT paper, presented at ICCV 2021. Here, I’d like to share a brief introduction to this foundational work.

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lucic, and Cordelia Schmid. Vivit: A video vision transformer. In ICCV, 2021.

You can read the paper on this site, or arXiv.

Overview of the propose model architectures in ViViT paper

ViViT is a video classification model that captures both spatial and temporal information by converting videos into sequences of image tokens, using only the transformer’s encoder and entirely avoiding convolutional processing.

In the paper, the authors created multiple architectures and conducted meticulous ablation studies to fine-tune the model. As a result, ViViT achieved state-of-the-art performance on several video classification benchmarks, surpassing the performance of previously proposed 3D convolutional models.

ViViT explores several methods for encoding videos across space and time. The initial step involves patchifying the frames using Vision Transformer (ViT) and embedding them.

One of the simplest approaches is to concatenate these embedded patches. Another method involves embedding patches from tubelet shaped regions, similar to 3D convolutions. It’s evident that as the size of the tubelet patches gets smaller, the computational load increases.

Model 1: The foundational model in this paper is the one referred to as Joint Space-Time in the TimeSformer paper. Please read my previous blog post to get detail it.

Model 2: A late fusion approach that encodes each frame by spatial transformer(Ls layers), then the z_cls of the results fed into temporal transformer(Lt layers). In this architecture, the presence of two separate Transformers makes it more lightweight(O((nh*nw)²+nt²)) in terms of computational complexity compared to Model 1(O((nh*nw*nt)²)).

I believe that the purpose of inputting the spatially outputted z_cls into the temporal information in the 2nd stage is to reduce computational complexity. My understanding is that by feeding the z_cls, which is recognized spatially, into the temporal Transformer, the model aims to capture object movements (actions).

Model3: This model is the same as Divided Space-Time model in the TimeSformer paper.

My intuition is that this approach does seem capable of capturing the areas that require attention in spatiotemporal contexts.

Model4: This model has the same computational complexity as Models 2 and 3, but it has the same number of parameters as Model 1. While it resembles Model 3 in that it separates spatial and temporal dimensions, it differs in that it modifies the multi-head dot-product for attention, using different heads to compute attention for each token separately over the spatial and temporal dimensions.

This approach is a bit interesting because it wasn’t the method used in TimeSformer.

Something I’ve been thinking about after I had read this paper

The submission deadline for ICCV 2021 was March 17, 2021 (11:59PM Pacific Time), and when looking at the references in the paper submitted to ICCV 2021, they cite arXiv version of TimeSformer.

This TimeSformer was published on arXiv on Tue, 9 Feb 2021, 19:49:33 UTC. They likely hurried to research, implement, and validate TimeSformer within the remaining month before the deadline(just only 1 month!).Could this be the reason? Although the paper claims to have achieved SOTA in the results that follow, it seems that the comparison may not be entirely fair.

Let’s take a look at the contents of the paper

Before diving into the experiment, it’s important to consider some key aspects when using Transformers to handle video data. Unlike methods based on CNNs, Transformers have fewer constraints, which means they typically require a large amount of training data. However, since video datasets are less abundant compared to image datasets, the approach to pre-training becomes crucial.

Therefore, the approach involves pre-training the model on the abundant image datasets before fine-tuning it with video datasets. However, it’s essential to address how to initialize parameters that are either incompatible or non-existent in image datasets.

Positional Encoding:

Since we use n_t frames of video data, it is necessary to extend the positional encoding p accordingly. The authors have opted to repeat the same p across the frames.

Embedding weight:

When pre-training on image datasets, the linear transformation E is learned as a 2D tensor. To adapt it for video data, it needs to be extended into a 3D tensor.One approach is to average the embeddings across the temporal dimension and use this as E.

Another method, authors called as “central frame initialization,” involves placing the E learned from the image dataset only at the central frame in the temporal dimension, while setting all others to zero.

Additionally, since Model 3 has two Multi-Head Self-Attention (MSA) layers, the spatial MSA is initialized using the pre-trained weights from the image dataset, while the temporal MSA is initialized to zero before fine-tuning.

Experimental Results

L is the number of transformer layers, each with a self-attention block of NH heads and hidden dimension d.

ViT-Base(as ViT-B): L=12, NH=12, d=768
ViT-Large(as ViT-L): L=24, NH=16, d=1024
ViT-Huge(as ViT-H): L=32, NH=16, d=1280
ViViT-B/16x2: ViT-B with hxwxt=16x16x2 tubelet

The authors initially tested the different encoding methods and found that the central frame approach yielded the highest accuracy. Therefore, they selected this method for the subsequent experiments.

The authors then compared the performance across different model types. Model 2 shows higher performance overall.

※ For the EK dataset, all the regularizations listed in Table 3 are applied.

The authors observed overfitting with relatively smaller datasets like EK and SSv2, even with pre-training on ImageNet. Therefore, they use the Factorised Encoder (Model 2) and incorporate regularization during training to address this issue and evaluate the model.

The authors report that they achieved up to approximately 5% improvement in accuracy with SSv2, similar to the results observed with EK.

Among the other experimental results, an interesting finding is from the experiment where the number of input frames was varied. It appears that accuracy plateaus after a certain number of frames(in case of ViViT-L/16x2, Model2).

Table 5 presents a comparison with other methods. Overall, ViViT-L/16x2 shows favorable results. However, it is important to note that the concurrently published TimeSformer does not incorporate the regularization techniques considered in the proposed method. The absence of comparisons with TimeSformer that includes regularization or with TimeSformer trained on the JFT dataset may make the results appear somewhat less fair.

Thank you!

Written by Taks.skyfoliage.com

This post is republished from skyfoliage.com