Paper Explained- MLP Mixer: An MLP Architecture for Vision

Nakshatra Singh
May 15 · 5 min read
MLP-Mixer consists of per-patch linear embeddings, Mixer layers, and a classifier head. Mixer layers contain one token-mixing MLP and one channel-mixing MLP, each consisting of two fully-connected layers and a GELU nonlinearity. Other components include: skip-connections, dropout, layer norm on the channels, and linear classifier head. The image is taken from page 2 of the paper.

This paper presents a neural network that is just a feed forward multi-layer perceptron (MLP), meaning there is no convolution, no attention mechanism, no lambda layers, nothing of those sorts. It’s just matrix multiplication, non-linearities, normalization and skip connections (adapted from ResNets’). This paper is similar to the abstractions elaborated in the recent SOTA paper known as ‘Vision Transformers’. I have formulated a blog explaining Vision Transformers meticulously, you can check it out here. 😌

MLP Mixer Architecture

The authors have proposed a classification architecture. Like in Vision Transformers, we divide the input image into small mini patches (preferable of size 16✕16). The image dimension must be divisible by the patch size. Now we just operate on these mini patches as we propagate through the network, unlike in convolutional neural network where we sort of shrink the resolution but increase the channel by making feature maps, here we are going to have one layer after another, all of same sizes and stack stack stack until the end. So it is much like a transformer, of course the difference between this and the transformer is in how the individual layers look. So like in the transformer, firstly, every patch is fed to a fully connected layer to bring it into a latent representation, also known as latent embeddings. Every patch in the image corresponds to one vector. Each patch is projected using the same function into the latent space.

Let’s try to understand what is a mixer layer and here comes the core of this architecture. Every patch fed to the MLP architecture is unrolled to a vector, each of these vectors are then stacked upon each other and can be interpreted as a table. Each row in this table represents a vector with 512 channels. There are two types of MLP mixer layers: token-mixing MLPs and channel-mixing MLPs.

The Mixer Layer- Explained

In token-mixing, we do the following, we transpose the table in such a way that, every row has the same channel from all the patches. So the first row signifies, channel 1 for all the patches in the image and we are to feed each row with the same fully connected layer (simple MLP layer). In fact, you can see all the weights in the fully connected layers are shared weights, this represents weight sharing across the same channel of different patches. The token-mixing MLPs allow communication between different spatial locations (tokens); they operate on each channel independently and take individual columns of the table as inputs.This helps us compute on the basis of feature by feature (512 channels are nothing but feature maps). The architecture uses a single-channel depth-wise convolutions for token-mixing. This is also known as cross-location operations.

In channel-mixing, since the weights are shared, on a meta level that means that now we can do the reverse trick again and flip the table back into patches then do the same shared computation for all the patches. The channel-mixing MLPs allow communication between different channels; they operate on each token independently and take individual rows of the table as inputs. The architecture uses a 1✕1 convolution for channel-mixing. This is also known as the per-location operation. These two types of layers are interleaved to enable interaction of both input dimensions.

Ultimately, each mixer layer has two weight matrices, one matrix is where we forward propagate all of the channels individually but in the same way. Second matrix is where one forward propagate all the patches individually but in the same way.

Mixer’s architecture is based entirely on multi-layer perceptrons (MLPs) that are repeatedly applied across either spatial locations or feature channels. The Mixer architecture relies only on basic matrix multiplication routines, changes to data layout (reshapes and transpositions), and scalar non-linearities.

Specifications of the Mixer architectures

If you have seen the vision transformer paper or the big transfer paper, all of this is extremely similar in terms of architectures. What they do is they build a bunch of different sized models with different patch resolutions. So, the resolution is always the number after the slash (/).

Table 1. Image taken from page 4 of the paper.

Compared to the Vision Transformer, due to the attention mechanism, they have a quadratic compute memory requirement as they increase the sequence length (i.e, as they lower the resolution) ultimately, the number of patches in the image increases and therefore, they suffer quadratically, whereas the Mixer suffers linearly only.

Effects of Scale

Let’s analyse this on one task mentioned in the paper. They are many tasks mentioned. We’ll be having a look at Linear 5-shot ImageNet Classification.

The image is taken from page 6 of the paper.

Let’s see Top-1 accuracy for 5-shot linear ImageNet Classification. Here is their definition of what the 5-shot classifier is: “we report the few-shot accuracies obtained by solving the L2-regularised linear regression problem between the frozen learned representations of images and the labels. This is how it works, you train a linear classifier on the frozen representation of what the model gives you and you evaluate it on top one accuracy. It’s a very particular task. We can clearly see that in this framing, this model scales much more favourably than other models. So BiT-R152 is good at small dataset, but as the training size increases, it plateaus and doesn't improve much more. However the Mixer model scales really well.


This model benefits from scale a lot more, it is a simpler architecture, it has a higher throughput (no.of images/per sec/per core), and, it is computationally more efficient. This paper is not very complicated and its simple architecture is it’s selling point. The trade off between accuracy and compute is fair. From a research perspective, it raises a lot of questions about inductive biases, how scale behaves and whether can you get everything to work with only SDG and a lot of TPUs. 😶‍🌫️

If you enjoyed this article and gained insightful knowledge, consider buying me a coffee ☕️ by clicking here. 🤤


  1. MLP-Mixer: An all-MLP Architecture for Vision.

If you liked this post, please make sure to clap 👏. 💬 Connect? Let’s get social:

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem

Nakshatra Singh

Written by

A Machine Learning, Deep Learning, and Natural Language Processing enthusiast. Making life easy for beginners to read SOTA research papers🤞❤️

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store