What is the most important stuff in Vision Transformer?

Akihiro FUJII
Analytics Vidhya
Published in
12 min readNov 1, 2021


This blog post describes the paper “Patches Are All You Need?” (Under review, 2021), which was submitted to ICLR2022 (under review as of the end of the Oct.). The ConvMixer proposed in this paper is composed of CNN+BN, and unlike previous Vision Transformer series, it can achieve results even on small datasets such as CIFAR. We will then discuss whether patches are really the only important thing with the point of view that the model contains the global information and local information processing mechanisms. In this article, I will explain it according to the following items.

  1. Summary
  2. Vision Transformer
  3. Structure of the Proposed Model
  4. Results
  5. Are patches really all you need?
  6. Conclusion

1. Summary

The summary of this paper is as follows.

ConvMixer can be implemented in about six lines in PyTorch. The model is more efficient than ViT and MLP-Mixer and can achieve 96% accuracy even on small datasets such as CIFAR. The authors argue that patching the image may have been more critical than the transformer itself from this result.

The Conv Mixer proposed in this paper has a structure that uses CNN and Batch Norm. In recent years, transformer-based models have been used in computer vision (CV) tasks such as image recognition, but in this paper, transformers are not used. However, this paper does not use a transformer. Nevertheless, it is characterized by better accuracy than the previous transformer-based models.

Based on this result, the authors argue that the recent breakthrough by transformer-based models may not be due to the transformer itself but rather to the “image patching process” that takes place there.

2. What is Vision Transformer?

First of all, I would like to explain ViT (Vision Transformer), which is the subject of comparison in this paper, and the transformer it is based on. So let’s start with the transformer.


The Transformer is a model proposed in the paper “Attention Is All You Need” (Vaswani et al., 2017). It is a model that uses a mechanism called self-attention, which is neither a CNN nor an LSTM, and builds Transformer model to outperform existing methods significantly. The results are much better than the existing methods.

Note that the part labeled Multi-Head Attention in the figure below is the core part of the Transformer, but it also uses skip-joining like ResNet.

Transformer architecture. from Vaswani et al., 2017

The attention mechanism used in the Transformer uses three variables: Q (Query), K(Key), and V (Value). Simply put, it calculates the attention weight of a Query token (token : something like a word) and a Key token and multiplies the Value associated with each Key. In short, it calculates the association (attention weight) between the Query token and the Key token and multiplies the Value associated with each Key.

Defining the Q, K, V calculation as a single head, the multi-head attention mechanism is defined as follows. The (single-head) attention mechanism in the above figure uses Q and K as they are. Still, in the multi-head attention mechanism, each head has its projection matrix W_i^Q, W_i^K, and W_i^V, and they calculate the attention weights using the feature values projected using these matrices.

Multi-Head Attention

If the Q, K, V used in this attention mechanism are calculated from the same input, it is specifically called Self-Attention. On the other hand, the upper part of Transformer’s decoder is not a “self-” attention mechanism since it calculates attention with Q from the encoder and K and V from the decoder.

The image of the actual application is shown in the figure below. The figure shows a visualization of the attention weights calculated for each Key token using the word “making” as a query. The transformer uses a multi-headed self-attention mechanism to propagate to later layers, and each head learns different dependencies. The Key words in the figure below are colored to represent the attentional weight of each head.

Attention Weights visualization. The image quated from Vaswani et al., 2017 and I have annotated it.

Attention Weights visualization. The image quated from Vaswani et al., 2017 and I have annotated it.

Vision Transformer (ViT)

Vision Transformer (ViT) is a model that applies the Transformer to the image classification task and was proposed in October 2020 (Dosovitskiy et al. 2020). The model architecture is almost the same as the original Transformer, but with a twist to allow images to be treated as input, just like natural language processing.

Vision Transformer architecture. The image is quoted from Dosovitskiy et al. 2020 and I have annotated it.

Vision Transformer architecture. The image is quoted from Dosovitskiy et al. 2020 and I have annotated it.

First, ViT divides the image into N “patches” of such as 16x16. Since the patches themselves are 3D data (height x width x number of channels), they cannot be handled directly by a transformer that deals with language (2D), so it flattens them and makes a linear projection to convert them into 2D data. So each patch can be treated as a token, which can be input to the Transformer.

In addition, ViT uses the strategy of pre-training first and then fine-tuning. ViT is pre-trained with JFT-300M, a dataset containing 300 million images, and then fine-tuned on downstream tasks such as ImageNet. ViT is the first pure transformer model to achieve SotA performance on ImageNet, and this has led to a massive surge in research on transformers as applied to computer vision tasks.

However, training ViT requires a large amount of data. Transformers are less accurate with less data, but become more accurate with more data, and outperform CNNs when pre-trained on the JFT-300M. For more details, please refer to the original paper.

Vision Transformer result.(Dosovitskiy et al. 2020)

3. Structure of the Proposed Model

Now, let’s get into the structure of the Conv Mixer proposed in this paper. First, the architecture of the model is shown in Figure 2 below.

ConvMixer architecture (from https://openreview.net/forum?id=TVHS5Y4dNvM)

The general architecture of the model is the same as that of ViT: Patch Embedding, multiple passes through the ConvMixer Layer block, and classification using Global Average Pooling and Fully-Connected layers. In terms of structure, there is no need to add Position Embedding vectors as in ViT, and the blocks themselves are relatively simple. The ConvMixer model (not the ConvMixer Layer, but the entire model) can be implemented in Pytorch with only six lines, as shown in Figure 3 below. Even though this implementation is a bit special, you can see that it is a model with a relatively simple structure.

ConvMixer code (from https://openreview.net/forum?id=TVHS5Y4dNvM)

Let’s take a closer look at the structure of the model, which can be broken down into the following steps, each of which we will look at in a little more detail.


2. ConvMixer layer xN

— 2.1 Depth-wise Conv

— 2.2 Point-wise Conv

3. Classification Using Liner Layer

1. Patching

In ViT, patching was achieved by dividing the image into fixed-size pieces, flattening them, and then linearly projecting them to convert them into 2D data. ConvMixer does not need to project the image (HxWxC) to 2D data because it uses CNN. ConvMixer does the same thing to divide the image into patches, but after dividing the image into patches, it applies Convolution to make features for each patch. It is easy to understand if you think of it as [CNN -> BN] replacing the [Flatten -> Linnear] process of ViT. The formula is as follows.


Incidentally, the activation function σ used here is called GELU (Gaussian Error Linear Unit, Dan et al., 2016). This activation function uses the cumulative function of the Gaussian distribution Φ, and is shown as the following equation. It asymptotes to RELU at infinity but shows smooth behavior near zero.

GELU(x) = x * Φ(x)

GELU activation function (Dan et al., 2016)

2 ConvMixer layer xN

Next, we will look at the ConvMixer layer, which is the core technology of this paper. The structure of the layer is a combination of depth-wise conv and point-wise conv as shown below. Let’s look at each of them in detail.

2–1. Depth-wise Convolution

First, let’s look at Depth-wise Conv, which is located in the first half of the ConvMixer layer. As its name suggests, it performs separate convolution for each depth or each channel. The diagram is as follows. (The figure is taken from this blog post.)

normal convolution (The figure is taken from this blog post)
depth-wise conv (The figure is taken from this blog post)

The upper figure shows a normal CNN, and the lower one shows depth-wise convolution on a 12x12x3(HWC) image. For normal CNN, the dimension of the convolution kernel is [kernel size, kernel size, channels]. The output map of each kernel is aggregated into a single feature map (one channel). By using this filter for the number of output channels (256 in this case), a feature map with the depth of the number of output channels is output.

On the other hand, in depth-wise conv, each kernel is [kernel size, kernel size, 1] (*multiple channels can be treated as one group, but for simplicity, I use 1). Unlike normal convolution, which spans all input channels, convolution is done for each input channel. In other words, global processing is done based on the information in each feature map.

2–2. Point-wise Conv

Next, let’s take a look at point-wise convolution, which is performed in the second half of the ConvMixer layer (the figure below is taken from this blog post).

point-wise conv (The figure is taken from this blog post)

This is the opposite of the depth-wise Conv described earlier, where each position is processed across the entire depth. In other words, local processing is done while considering the information in the full feature map.

3. Classification Using Liner Layer

This one is the same as the process done in regular ResNet and EfficientNet. The information in each feature map is converted to a single value by taking the average value (Global Average Pooling). Then the classifications are performed by weighting the values (Dense Layer).

In the transformer-based method, the output vector is two-dimensional so that it can be propagated to all the convenience layers without this averaging process.

4. Results

First, let’s look at the results of the ImageNet training (see the figure below). This is the result of training ImageNet without pre-training, and we can see that ConvMixer has higher accuracy for the same number of parameters.

ImageNet results (from https://openreview.net/forum?id=TVHS5Y4dNvM)

Another feature of ConvMixer is its accuracy on large datasets like ImageNet and small datasets like CIFAR. In many papers that propose ViT series such as ViT and MLP-Mixer, the training results of CIFAR10 after fine-tuning are included, but the accuracy of full-scratch training is not. I think this is because the ViT series model requires a lot of data, and the accuracy is not very good. In this paper, the results of full-scratch training of CIFAR are presented, and it is claimed that ConvMixer is data-efficient, with an accuracy of about 96% (Table 3).

CIFAR10 results (from https://openreview.net/forum?id=TVHS5Y4dNvM)

5. Are patches really all you need?

Note! This section contains a lot of personal opinions.

Since the introduction of ViT at the end of 2020, various improvement methods have been proposed. Some improvements use transformers, and some structures do not rely on transformers, such as this paper and MLP-Mixer(Ilya et al, 2021). So, was it important to “Patches” as this paper claims? In this section, let’s observe the architecture from a different perspective.

First, let’s compare three architectures: ViT, MLP-Mixer, and ConvMixer.

Vision Transformer architecture. The image is quoted from Dosovitskiy et al. 2020 and I have annotated it.
MLP-Mixer(Ilya et al, 2021)
ConvMixer (from https://openreview.net/forum?id=TVHS5Y4dNvM)

The ViT’s Transformer Encoder block, the MLP1 of the MIP-Mixer, and the ConvMixer layer of the ConvMixer all consist of processing that combines local information processing (information propagation within a patch) and global information processing (information propagation across patches) within a block.

For global information processing, ViT uses Self-Attention, MLP-Mixer uses MLP1, and ConvMixer uses depth-wise conv to process global information. Conventional CNN models like ResNet and EfficientNet slide a fixed size kernel (primarily three is used), so their layers process and propagate a narrow range.

The depth-wise conv used in ConvMixer seems to be weak for global information processing. Still, ConvMixer sets the kernel size of CNN to 7 or 9 (usually three is often used), and it is thought to be designed to process global information. In fact, as the kernel size is reduced, the accuracy decrease. As for local information processing, ViT uses (position-wise) FFN, MLP-Mixer uses the MLP2, and ConvMixer uses point-wise conv to process local information.

In this way, the network of the ViT series follows the structure of the transformer blocks in which use global information processing and local information processing. I believe that this mechanism of processing global information is difference from the conventional CNN model.

What are the other options for global processing besides transformers, MLP, and CNN? Not a CV paper, but a paper that tests this is “FNet: Mixing Tokens with Fourier Transforms” (James et al., 2021).

FNet (James et al., 2021).

FNet employs the Fourier transform as the global process. Since It is summing with a different basis, it is challenging to interpret physically, but it has achieved some good results. As the title says, this paper focuses on mixing tokens. In addition to Fourier transforms, the authors tried models that mix tokens (like words) with random matrices and linear transforms to account for global information.

Although patching may also help, from this FNet paper, it seems that considering global information in addition to local information has some effect

6. Conclusion

In this post, I have explained ConvMixer, which can go beyond ViT with a straightforward model using CNNs. In the title, the authors claimed that patching is important. However, maybe it is also important to do global processing that cannot be considered in traditional CNNs.

— — — — — — — — — — — — — — — — — — –

🌟I post weekly newsletters! Please subscribe!🌟

— — — — — — — — — — — — — — — — — — –

Other blogs

— — — — — — — — — — — — — — — — — — –

About Me

Manufacturing Engineer/Machine Learning Engineer/Data Scientist / Master of Science in Physics / http://github.com/AkiraTOSEI/

Twitter : https://twitter.com/AkiraTOSEI