Published in


ConvMixer: Patches Are All You Need? Overview and thoughts 🤷

CNNs don’t always have to progressively decrease resolution. A revolutionary idea that might shape the next-gen architectures for Computer Vision.



  • Transformers(NLP): Isotropic architecture+Self-attention
  • ViT: Transformers(NLP)+Patch representation
  • CNNs: Pyramid architecture(decreasing resolution)+Convolution
  • ConvMixer: Isotropic architecture+Patch representation+Convolution

Transformers are represented by the extensive use of attention and their isotropic architecture, which repeats the same block multiple times. They demonstrated amazing performance for many problems, especially in natural language processing. However, the quadratic computation complexity of self-attention was a major bottleneck for vision, where image resolutions are very large. Therefore, the mainstream architecture in computer vision was CNNs for many years.

This has changed as recent works on vision transformers proposed a method to split images into patches, and embed each image patch into a token that can be fed into transformers. ViT demonstrated promising performance and was further improved to outperform CNNs in many vision tasks on later works such as DeiT and Swin transformers. The authors of ConvMixers suggest an architecture based on the idea that while some gains of vision transformers are due to the powerful Transformer architecture, the patch representation could be an important factor.

The authors completely destroy the conventions of CNN architectures, namely the pyramid design of increasing feature sizes and decreasing resolution that hasn’t changed since AlexNet. ConvMixers are certainly one of the most revolutionary ideas in computer vision and isotropic vision architectures. They are not proposing a state-of-the-art network by any means but bring up an important discussion: patches work extremely well in convolutional architectures, and are worth focusing our attention on.

The authors:

  • Describe an extremely simple yet effective class of models that can literally fit in 280 characters and achieve 80% classification accuracy without extensive experiments.
  • Compare how the new architecture could be interpreted when compared to other ViTs and CNNs.

Original paper: Patches Are All You Need? (under review)

A rather unconventional title for a research paper 🤷

ConvMixer Architecture

The proposed architecture is very simple. It has a patch embedding stage followed by isotropically repeated convolutional blocks.

Patch embedding summarizes a p×p patch into an embedded vector of dimensions e. The authors implement this by a single convolution with kernel size p, stride p, and h output channels, followed by a non-linearity. This surprising trick will convert the n×n image into features of shape h × n/p × n/p.

The successive convolution and pooling of CNNs and transformers of ViTs are replaced by consistently repeated ConvMixer blocks. A single ConvMixer block is a slightly modified depthwise separable convolution, which is widely used in modern CNN architectures. In a typical CNN, the feature size is reduced step-by-step by pooling or strided convolutions and the number of channels is increased. Conversely, every intermediate features of ConvMixer have consistent dimensions.

Similar to typical CNNs, the features are flattened via global average pooling, and inference is made using a softmax classifier in the final stage.

Interpreting ConvMixer as variants of other architectures

After reading the paper, many similar architectures came up in my mind. The authors also suggest similar concepts. I think the characteristics of similar architectures can be helpful for understanding the ConvMixer architecture.

MLP-Mixer v.s. ConvMixer

Most of all, the proposed ConvMixer architecture is, for the most part, an MLP-Mixer with convolutions. It works directly on embedded patches, the resolution and size are consistent throughout the layers. Moreover, depthwise separable convolution separates channel-wise mixing and spatial mixing of information similar to MLP-Mixer(even the skip-connections are the same).

Meanwhile, the network really is a fully convolutional neural network. All the operations of ConvMixer can be implemented using only activations, BN, and convolutions. Thus it is really just a CNN with some specific architectural hyper-parameters. Specifically,

  • Large downsampling in the initial layer but no more in the main bottleneck.
  • Isotropic architecture with same resolutions, #channels in every layer.
  • Unconventionally large kernel sizes, which we will discuss soon(note kernel_size=9 as default in the implementation above).

I think this is especially revolutionary because these choices result in a CNN that completely destroys the conventions of CNN architectures, namely the triangular design of increasing feature sizes and decreasing resolution that really hasn’t changed since AlexNet.

However, we must note that the repeated isotropic CNN network runs on an image size of 224/7=32, which is not small when compared to most CNNs that have more layers that work on features of smaller resolutions.

The isotropic architecture is really similar to transformers(both NLP and vision), while the main computations are performed with convolutions instead of self-attention. Therefore, I personally understood the architecture as:

Designing CNNs like Transformers, using patch representations.

OK, revolutionary is cool. But, what are the advantages of designing CNNs like Transformers? And do they really perform well?

TL;DR: Yes, they show promising performance.

Motivation & Advantages: Large receptive field

We will discuss some theoretical advantages and motivations of patch representations and the ConvMixer architecture.

The authors suggest that the motivation of this work was on replacing the mixing operations of MLP-Mixer with convolutions. Depthwise convolution can mix spatial location and pointwise convolution can mix channelwise location.

MLP and transformers can model far-apart information during spatial location mixing, but convolutions can only mix information within the kernel size. However, the authors argue that this inductive bias of convolution is well-suited to vision tasks and leads to high data efficiency.

The ability to map spatially far-apart information, a.k.a receptive field can be controlled by the kernel size of depthwise convolution. MLP-Mixer is when we use a kernel size equal to the input resolution. The authors find a balance in unconventionally large kernel sizes(e.g. 9) for the depthwise convolution to increase the receptive field more quickly and find it beneficial to do so.

Moreover, patch embeddings yield a larger receptive field because downsampling happens all at once and all layers operate in a small resolution.


Design parameters of ConvMixer include:

  • Patch size: The dimension of patches
  • Kernel size: Kernel size of the depthwise convolution
  • Width: Dimension of patch embeddings e, which is kept consistent throughout the network and is equal to the dimension of output features.
  • Depth: Number of ConvMixer blocks(layers) to use.

Network configurations are named ConvMixer-h/d where h refers to the width and d refers to the width of the network.

All these experiments were completed in a controlled environment, where potentially confounding configurations such as data augmentation strategy or learning rate schedule was fixed during each experiment.

ConvMixer-1536/20 with 51.6M parameters, the largest and best-working ConvMixer setup demonstrates 81.6% ImageNet accuracy and outperforms other similar-sized ResNets and ViTs. The experiment setup implies more interesting insights on validating the performance of ConvMixers.

The training configuration is similar to DeiT because DeiT is similar to the original ViT in terms of network architecture, it should be meaningful to evaluate the effect of incorporating convolutions in the ViT design. I would say that the comparisons were contentious, at least when ConvMixer was trained with limited compute and sub-optimal hyper-parameters.

When comparing similar-size DeiT and ConvMixers, ConvMixers did perform slightly better in terms of model size(parameter count) and accuracy, although no model was able to win the largest DeiT-B model. However, the throughput was significantly worse, especially for networks with large kernel sizes. Although this was because ConvMixers use a significantly smaller patch size of 7 as the authors observe similar throughputs when using DeiT with patch size 7, the slow inference is a significant downside because accuracy significantly deteriorates when using larger patch sizes for ConvMixers.

Three ResNets were trained on the same training configuration and was outperformed by ConvMixers. The Isotropic MobileNetv3 benchmark is on previous work on repeating isotropic MobileNet blocks. The building block is much complex and the authors suggest that the motivation is very different from this work. I can’t comment much on this since I am not aware of that work, but it definitely seems worth checking out.

However, due to limited compute, the models were trained on substantially lesser epochs compared to competitors and according to the authors, there wasn’t hyperparameter tuning and they were selected with common sense from one model. Thus the models could be over or under-regularized, and the reported accuracies likely underestimate the true capabilities of our model.

On the choice of hyper-parameters, we observe two significant trends:

  • Small patch size is crucial for performance but requires significantly more in compute.
  • Increasing the kernel size has significant benefits while requiring relatively small compute.


A special advantage of ConvMixer is the simplicity of the isotropic and basic architecture. We can see this in a real implementation described below. No CNN nor ViT could possibly be implemented with just 280 characters in PyTorch. Apart from being able to “fit in a tweet”, being simple to implement can be an advantage because it potentially enables the application of complex architectural improvements, possibly from both the CNN and transformer domain.

‘def ConvMixr’..? 281 characters should be appropriate…🤔


  • Transformers(NLP): Isotropic architecture+Self-attention
  • ViT: Transformers(NLP)+Patch representation
  • CNNs: Pyramid architecture+Convolution
  • ConvMixer: Isotropic architecture+Patch representation+Convolution

ConvMixer is a very simple network architecture that combines the idea of patches in ViTs and convolution. I was surprised at how I had unintentionally accepted the idea that CNNs must progressively decrease in resolution, without knowing why.

The results are promising, but there is certainly room for improvement before ConvMixers become the baseline for next-gen vision models. While large receptive fields are certainly important, I feel that there is not enough theoretical support on the effectiveness and efficiency of ConvMixers. There seem to be more answers to the question: “why are patches all you need?”.

The authors also admit that they are suggesting a possibility and a proof of concept of a revolutionary idea. Still, huge congratulations to the authors for their amazing work! I’ll be looking forward to follow-up papers that further improve this method and possibly achieve state-of-the-art performance in many benchmarks. I personally think it is very probable.

Also, I think researches that study the characteristics of ConvMixer-based models would be interesting because the architecture is a mix of two very different architectures. For example, how can we answer questions like: should we scale ConvMixers like EfficientNets, or like ViTs? Do they see more like ViTs or CNNs? How can we effectively apply them to pixel-level tasks like semantic segmentation?




Everything connected with Tech & Code. Follow to join our 900K+ monthly readers

Recommended from Medium

Tensorflow2.0 HelloWorld using Google Colab

Image result for tensorflow

An Intro to Natural Language Processing in Python: Framing Text Classification in Familiar Terms

Scaling up with Distributed Tensorflow on Spark

Using CNN to build a sneaker authenticator: Data Cleaning(2/3)

A menagerie of text ‘attack’ libraries

Siri, what is an RNN?

Predicting MNIST Dataset using Keras

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Sieun Park

Sieun Park

Loves reading and writing about AI, DL💘. Passionate️ 🔥 about learning new technology. Contact me via LinkedIn:

More from Medium

Why Do Better Loss Functions Lead to Less Transferable Features? — Paper Summary

Review — Gaussian Error Linear Units (GELUs)

(Free image from Pixabay)

Vision Transformers for Femur Fracture Classification

Paper Review: ConvNext or Convnets for 2020s