This paper presents a neural network that is just a feed forward multi-layer perceptron (MLP), meaning there is no convolution, no attention mechanism, no lambda layers, nothing of those sorts. It’s just matrix multiplication, non-linearities, normalization and skip connections (adapted from ResNets’). This paper is similar to the abstractions elaborated in the recent SOTA paper known as ‘Vision Transformers’. I have formulated a blog explaining Vision Transformers meticulously, you can check it out here. 😌
MLP Mixer Architecture
The authors have proposed a classification architecture. Like in Vision Transformers, we divide the input image into small mini patches (preferable of size 16✕16). The image dimension must be divisible by the patch size. Now we just operate on these mini patches as we propagate through the network, unlike in convolutional neural network where we sort of shrink the resolution but increase the channel by making feature maps, here we are going to have one layer after another, all of same sizes and stack stack stack until the end. So it is much like a transformer, of course the difference between this and the transformer is in how the individual layers look. So like in the transformer, firstly, every patch is fed to a fully connected layer to bring it into a latent representation, also known as latent embeddings. Every patch in the image corresponds to one vector. Each patch is projected using the same function into the latent space.
Let’s try to understand what is a mixer layer and here comes the core of this architecture. Every patch fed to the MLP architecture is unrolled to a vector, each of these vectors are then stacked upon each other and can be interpreted as a table. Each row in this table represents a vector with 512 channels. There are two types of MLP mixer layers: token-mixing MLPs and channel-mixing MLPs.
The Mixer Layer- Explained
In token-mixing, we do the following, we transpose the table in such a way that, every row has the same channel from all the patches. So the first row signifies, channel 1 for all the patches in the image and we are to feed each row with the same fully connected layer (simple MLP layer). In fact, you can see all the weights in the fully connected layers are shared weights, this represents weight sharing across the same channel of different patches. The token-mixing MLPs allow communication between different spatial locations (tokens); they operate on each channel independently and take individual columns of the table as inputs.This helps us compute on the basis of feature by feature (512 channels are nothing but feature maps). The architecture uses a single-channel depth-wise convolutions for token-mixing. This is also known as cross-location operations.
In channel-mixing, since the weights are shared, on a meta level that means that now we can do the reverse trick again and flip the table back into patches then do the same shared computation for all the patches. The channel-mixing MLPs allow communication between different channels; they operate on each token independently and take individual rows of the table as inputs. The architecture uses a 1✕1 convolution for channel-mixing. This is also known as the per-location operation. These two types of layers are interleaved to enable interaction of both input dimensions.
Ultimately, each mixer layer has two weight matrices, one matrix is where we forward propagate all of the channels individually but in the same way. Second matrix is where one forward propagate all the patches individually but in the same way.
Mixer’s architecture is based entirely on multi-layer perceptrons (MLPs) that are repeatedly applied across either spatial locations or feature channels. The Mixer architecture relies only on basic matrix multiplication routines, changes to data layout (reshapes and transpositions), and scalar non-linearities.
Specifications of the Mixer architectures
If you have seen the vision transformer paper or the big transfer paper, all of this is extremely similar in terms of architectures. What they do is they build a bunch of different sized models with different patch resolutions. So, the resolution is always the number after the slash (/).
Compared to the Vision Transformer, due to the attention mechanism, they have a quadratic compute memory requirement as they increase the sequence length (i.e, as they lower the resolution) ultimately, the number of patches in the image increases and therefore, they suffer quadratically, whereas the Mixer suffers linearly only.
Effects of Scale
Let’s analyse this on one task mentioned in the paper. They are many tasks mentioned. We’ll be having a look at Linear 5-shot ImageNet Classification.
Let’s see Top-1 accuracy for 5-shot linear ImageNet Classification. Here is their definition of what the 5-shot classifier is: “we report the few-shot accuracies obtained by solving the L2-regularised linear regression problem between the frozen learned representations of images and the labels. This is how it works, you train a linear classifier on the frozen representation of what the model gives you and you evaluate it on top one accuracy. It’s a very particular task. We can clearly see that in this framing, this model scales much more favourably than other models. So BiT-R152 is good at small dataset, but as the training size increases, it plateaus and doesn't improve much more. However the Mixer model scales really well.
This model benefits from scale a lot more, it is a simpler architecture, it has a higher throughput (no.of images/per sec/per core), and, it is computationally more efficient. This paper is not very complicated and its simple architecture is it’s selling point. The trade off between accuracy and compute is fair. From a research perspective, it raises a lot of questions about inductive biases, how scale behaves and whether can you get everything to work with only SDG and a lot of TPUs. 😶🌫️
If you enjoyed this article and gained insightful knowledge, consider buying me a coffee ☕️ by clicking here. 🤤
If you liked this post, please make sure to clap 👏. 💬 Connect? Let’s get social: http://myurls.co/nakshatrasinghh.