Why this one (literally) small model spells big things for Vision Transformers.

8 min readOct 12, 2021

Brief Anthology and history of transformer models

Ever since Transformers were introduced into computer vision research mainstream by the seminal ViT paper(An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale), many papers and models followed.

From : An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Some papers focus on how to train them effectively and efficiently (DeiT, Training data-efficient image transformers & distillation through attention), some aimed to go even deeper (DeepViT) and deeper (CaiT).

The trasformer maybe reformulated to be more efficient for vision tasks (ResT: An Efficient Transformer for Visual Recognition)

While Transformers were favored for their lack of inductive spatial bias and model capacity, convolutions were reexamined and often reincorporated for their inductive bias and parameter efficiency and reintroduced(CvT: Introducing Convolutions to Vision Transformers) in various(ConVit) forms(CCT). Some focus on the patch embeddings or patch merging between stages. Others take a more involved approach to enhance the QKV attention layer with convolutions. Even the MLP/FFN layer is not left alone with certain models introducing convolutions to replace or augment the layers.

From CvT. We can see that convolutions are used for Token embeddings and QKV projections

Many architectures have emerged with sufficient robustness (RVT, Towards Robust Vision Transformer) or maturity to serve as baselines (PvT). Some even achieve this while relying on convolutions minimally (Swin, CSwin). Many of these aforementioned models are suitable for multiple tasks including segmentation. These models tend to be competitive or better than CNN based networks of similar FLOPs or parameter, latency metrics.

From PVTv2: Improved Baselines with Pyramid Vision Transformer. This model incorporates many ideas such as hierarchical construction, convolutional embeddings and feed forward networks.

Another noteworthy model is CoatNet. Here Google researchers found a good hybrid model employing both convolutional stages and self attention stages. They then scale this model into the massive 2.44B parameter, 2.5 TFLOPs monster model that takes 20.1K TPU core days to train and set the new Top-1 ImageNet-1 accuracy SoTA at 90.88%! (As massive this model is, it is still half in FLOPs compared to the humongous ViT-G/14 with 5.1TFLOPs which also took more than 30K core days)

I highly encourage reading Google AI blog release post or yannich kilcher’s MLNews segment on it for those who are interested.

To be frank, all of these architectures, ideas, new SoTA numbers are very exciting by themselves but realizing that many of the papers are discussing and exploring orthogonal ideas (and hence potentially combined) makes the outlook for further research much more enjoyable.

Big transformers, small convolutions

Just from this brief collection of recent transformer based models, the trend of transformer models being large is apparent.

The general experience is that, convolutions are better for small efficient models while Transformers are better for larger model capacity

Ever since the huge ViT models have been introduced, it seemed that transformer based models were better for the large capacity regime.

Taken from :CSWin Transformer. Here we can see how large “small/tiny” vision transformers are.

Among Vision Transformers, the small or tiny versions are around 20~30 Million parameters and 4~5G FLOPs. This is the size of ResNet-50. Although ResNet50 is a very ubiquitous and well studied model, it is far from “small”.

To put things in perspective, EfficientNets (v1) scales from 5.3 M parameters and 0.4 FLOPs (For B0 77.1% top-1). RegNets can scale from 3.2 M Parameters and 0.2 GFLOPs (RegNetY-200MF 70.4% Top-1). Although they are far from SoTA performance, they do scale to be functional in whatever regime they occupy.

Although it is not impossible for Vision Transformers to work in this mobile regime, it seems they require orders of magnitude more compute and data (such as JFT-300Mor JFT-3B both of which are NOT available outside Google) compared to a CNN of similar size.

Take from Scaling Vision Transformers. Here S/28 has 5.4M parameters and 0.7GFLOPS s/16 has 5.0M parameters and 2.2GFlops.

Recent efforts on hybrid architectures utilizing convolutions in various areas of the model seem to bridge this gap. A hybrid model extensively utilizing convolutions, PVTv2-B0 has 3.4M parameters 0.6GFLOPs and top-1 70.5% performance.

MobileVit : small, yet significant

However, one recent paper discussing very small models has got me more excited than even the largest and beefiest of architectures that I mentioned above. The paper in question is MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer.

Why you might ask, does a Vision Transformer model focusing on being light-weight mobile friendly is worthy of interest?

MobileVits put forward three compact networks. The smallest XXS has 1.3M parameters and 0.2 GFLOPs(top-1 69.0%), XS has 0.6 GFLOPs, 2.3M parameters and top-1 , 74.8% and finally the “largest”sibling S has 5.6M parameters and 1.1 GFLOPs with top-1 78.4% performance.

As a side note, It should be noted that the FLOPs numbers were not given in the original paper, so I used an independent implementation, and ran torchinfo on them at 256 by 256 to get the results. Although the authors justify this citing that even for similar FLOPs numbers the actual runtime or latency might vary wildly depending on the architecture and platform, I would have still appreciated an objective number to compare models by.

The paper emphasizes that latency, rather than FLOPs was the primary concern, and that FLOPs could be further reduced through other optimizations.

A quick look into the MobileVit architecture

MV2 blocks are used mainly for downsampling

All of the MobileVit models share the same general construction and differ only in hidden dimensions. From an input size of 256x256, a mix of 6 convolutional layers downsample the input into 32x32. Only the first layer is a vanilla 3x3 convolution and all subsequent layers are MobileNetV2 blocks with (downsampling blocks) or without a stride of 2.

A MobileVit block applies a transformer in a very specific way that the authors refer to as Transformers as Convolutions. The input tensor of dimension H x W x C first goes through a convolution of 3 x 3 (n x n, where n could be any number but the paper only ever discusses for the case n = 3) and a 1x1 kernel to arrive at a representation of dimension H x W x d. This tensor is divided into non overlapping patches of size h x w x d. Although h and w can be of various sizes (The paper discusses ablation studies for h=w values ranging from 2,3,4 and 8 and even differs them between stages) the final models settle for a uniform 2 x 2 x d.

Each of the patches are now “Unfolded” forming an intermediate representation with P x N x d dimensions (where, P=w * h and N = H * W / P). On the representation, transformers are applied.

As things can be a bit difficult to follow, I will show you an example for the first MobileVit block in the XXS model.

For the first MobileVit block in XXS takes in a 32 x 32 x 48 tensor as input. This is first convolved by a 3 x 3 layer and then expanded into a 32 x 32 x 64 tensor by a 1 x 1 layer. This tensor is divided into 256 (32 x 32 / 4) individual 2(w) x 2(h) x 64 patches. The patches are each “unfolded” or flattened into 4 x 64 patches (or “tokens”). We now have 256, 4 x 64 patches. These patches are stacked into a 4 x 256 x 64 tensor and subsequently processed by a transformer for interpatch self attention.

After this, they are reformed into 32 x 32 x 64 as before going through the transformers, a 1x1 layer reforms them into the original input size of 32 x 32 x 48. A skip connection concatenates the original unprocessed input and we get 32 x 32 x 96 (input x 2). A final 3x3 convolution fuses them back to their original input size of 32 x 32 x 48

General purpose, not one trick pony

In image classification, Compared to other comparable lightweight CNNs of similar parameters or heavier CNNS scaled down, MobileViTs hold their own.

They also seem useful as backbones Across other vision tasks such as Object detection or semantic segmentation.

One point where ViTs are still stumped is the latency on mobile devices. Even the MobileVit put forward on the paper is around an order of magnitude slower than years old MobileNetV2 (0.92 ms vs 7.28ms) The paper does not describe the specific device, one can assume that it might be an iPhone with ARM based instructions considering the authors are affiliated with Apple. The authors believe that as mobile devices incorporate specialized kernels and hardware features to accelerate transformers, this short coming will be abridged.

Considering how prevalent transformers are in other architectures such as language models, which are also useful in mobile devices, I too am confident this day will soon arrive.

A hidden Gem?

Often, papers employ specialized training methods that are only utilized to train their own model while leaving other models meant for comparison untouched. This paper utilises a Multi-scale sampler and uses it to train many of the CNNs used for comparison and not only their own model.

Fine tuning with multiple different input sizes are a useful tool to enhance model generalization and sometimes a specific image size can be chosen to enhance capability (see : Fixing the train-test resolution discrepancy: FixEfficientNet). The Multi-scale Sampler furthers this idea by choosing multiple image sizes and appropriate corresponding batch sizes during training. Not only does this process reduce training time, it increases accuracy. It is intuitive, useful, and the effects are replicated in other models such as ResNet and MobileNetv2 too. Considering that the paper provides the source code for a Pytorch implementation in the paper itself, one can only hope this idea, as does its model catch on.

Conclusion

MobileViTs seem to be a strong counter example to the idea that CNNs are good for small models and ViTs are good for large models. It is a effective model that exceeds many areas of even the best of small CNNs. Although the mobile device latency leaves much to be desired, I am convinced that even that will change as hardware adapts to accomodate transformers in mobile and embedded devices.