Scaling Vision Transformers

Published in

CodeX

6 min readAug 5, 2021

Modern deep learning systems believe in scale. Large neural networks with billions and even trillions of parameters seem to perform amazingly, and thus the property of a neural network to scale is important. There were prominent works on effective scaling methodology of CNNs and transformers. Vision transformer(ViT) is a fully-transformer architecture that showed comparable training performance with state-of-the-art CNNs in image classification. How should we scale vision transformers? What happens if we scale the data and model in training ViT?

This recent(June 2021) paper studies the properties in scaling vision transformers by experimenting on varying data and model sizes. As a conclusion, the paper suggest a scaling law for vision transformers, a guideline for scaling vision transformers. The paper also suggests architectural changes to the ViT pipeline. As of August 4, the proposed network achieves the state-of-the-art on ImageNet with 90.45% top 1 accuracy.

We explored the concept of ViT in a previous post. This post will be a summary of the paper: “Scaling Vision Transformers”

Architectural changes

The paper provides changes to the ViT framework and make conclusions on the hyperparameter space, as listed below.

Decoupled weight decay for the “head”: The paper finds that the prefered weight decay strength in few-shot learning is different for the final linear layer(head) and the backbone. The figure above demonstrates that small weight decay for the body and large weight decay for the head is largely beneficial to the model.
Saving memory by removing the [class] token: Adding a [class] token to the original 256 patch encoding tokens results in tokens 257 tokens, and in TPU hardware, this increase from 256 to 257 results in a 50% memory overhead. Instead of using the class token, global average pooling or multihead attention pooling is used to aggregate patch encodings. The head is simplified by removing the final non-linear projection.
The modification doesn’t seem to pose significant effects to the performance according to the right figure.

Scaling up the data from 300M to 3B images improves the performance of both small and large models.
Memory-efficient optimizers: Because a billions of parameters are trained, the storage space needed for the adam optimizer is a large bottleneck (additional 16GiB is needed according to paper). The paper proposes to use Adam with half-precision momentum or a modified Adafactor optimizer which reduces the optimizer overhead from 2-fold to 0.5-fold of the weights.
Additional training techniques: The paper experiments the effects of common training techniques that are used to improve model performace, such as learning rate schedules.

Scaling ViT

CNNs are typically scaled by the following three factors:

width: # channels of each layer
depth: # layers
resolution: input image size

Because of the deep relation with attention, there are many parameters that control the scale of ViT. These include the patch-size, the number of encoder blocks (depth), the dimensionality of patch embeddings and self-attention (width), the number of attention heads, and the hidden dimension of MLP blocks (MLP-width).

The paper uses the Tensorflow XLA compiler to optimize the runtime speed and memory. XLA trades off memory and speed optimally and outputs a number of model architecture configurations listed in the table below.

However, not every network was able to run on a single device. The authors perform a simulation described in the figure below to measure whether each configuration can actually be implemented. The memory-related modifications from this paper allowed the training of models in the green and blue region.

The ViT paper contains a study about effective rules for trading off different components. The rule is to scale all depth, width, MLP-width, and patch-size simultaneously by a similar amount. The final models are selected in this manner as shown in the diagonal pattern shown in the figure below.

Insights on Scaling ViT

Most importantly, this paper studies patterns in the effect of modifying the network size, data, and duration. In the experiments, a representation quality metric measures the usefulness of the learned features. Precisely, it is measured by (i) few-shot transfer via training a linear classifier on frozen weights, (ii) transfer via fine-tuning the whole model on all data. The paper suggests that

The figure above plots the error rate as a function of computation. The performance of larger models seems to saturate after a certain degeree(~10% error in ImageNet).
When plotted solely against model size (top right figure) or dataset size (bottom right figure), the optimal training setting also increases in scale.
Scaling up compute, model, and data together improves representation quality. As depicted in the left and center figure, training with largest model, datatset, and compute achieves the best performance in the lower right point.
Smaller models(blue), or models trained on fewer images (small)fall off the curve when trained longer.
Smaller models didn’t benefit from increasing dataset size/compute resources. While large models seem to significantly benefit from even over 1B images.

The figure above studies the error rate of different sized models with resepect to the number of steps.
Big models are more sample effeecient. For example in 10-shot learning, the Ti/16 model needs to see 100 times more images to achive the performance of the L/16 model and 20 times in fine-tuning. When training on sufficient data, learning a larger model with fewer steps is better.

The few-shot learning results seem unfair however. The competitors used unlabeled but in-domin data for pre-training while the ViT-G model was able to leverege self-supervised learning on larger data for different tasks.

I could be missing something, but the paper seems to suggest contradicting statements on the property of scaling ViTs. The sentence “Both small and large models benefit from this change, by an approximately constant factor…” in section 3.3 and “Further, when increasing the dataset size, we observe a performance boost with big models, but not small ones.” in section 2.1 are contradictory.

Summary

We will summerize the observations made in this study to a conclusion.

It is effective to scale total compute and model size simultaneously. Especially, not increasing a model’s size when extra compute becomes available is suboptimal.
Vision Transformer models with enough training data roughly follows a (saturating) power law.
Larger models perform better in few-shot learning.
Proposes new training techniques, improving performance and reducing computational bottleneck.

Vision transformers are an effective, but not-yet researched branch in computer vision. Follow-up papers that discuss the various properties of ViT are gaining large interest. Especially, researchers from Google Brain seem to have large interests. To further understand the various properties of ViTs, also have a read of:

Scaling Vision Transformers

Architectural changes

Scaling ViT

Insights on Scaling ViT

Summary

Written by Sieun Park