gMLP: Winning over Transformers?

10 min readFeb 6, 2022

Alright, we all know that transformers are cool. At least in terms of NLP, these architectures are considered to be state-of-the-art (SOTA) for language modelling, and help us perform beautifully on various downtream tasks, such as named-entity-recognition (NER), question answering (QA), part of speech tagging (POS) etc.

But in this tutorial, we will dive into another architecture called Gated Multilayer Perceptron (gMLP), proposed by Google Research team.

The outline of this tutorial:

Transformers: quick recap
Why gMLP
gMLP
Outperforming Transformers?

Transformers: quick recap

As I mentioned above, transformer architectures are very powerful, and if you want to achieve a really high performance in your particular task, you should consider using some pre-trained transformers. You could usually find them on Huggingface.

I would recommend to check out this tutorial (work of art rather) or this one if you are new to Transformers or in case you want to refresh some knowledge.

Regarding the architecture itself, there is a very important component, which helped transformers achieve supremacy: attention.

Attention mechanism

There is great visualisation in this notebook, from which we can understand the idea behind the attention layer.

When attention layer is added in a neural network, the layer is used to focus on the important neurons of the layer behind. Moreover, overall idea of the attention is to focus on the important parts of the input, layers and information as a whole when the network sees a particular example.

Like this:

So what’s the problem?

Attention layer uses a number of matrices, which increase the number of parameters. Now, imagine you have the whole network injected with attention mechanisms:

Hence, the number of parameters or amount of data you would need to train a Transformer is quite large. Here’s the comparison table for training arguably best-performing transformer architectures for NLP:

Please, do not cry when you see the size and training time

Why gMLP?

As stated in the original gMLP article by Hanxiao Liu et. al., 2021:

“On one hand, the attention mechanism [18] introduces the inductive bias that the spatial interactions should be dynamically parameterized based on the input representations. On the other hand, it is known that MLPs with static parameterization can represent arbitrary functions [19]. It therefore remains an open question whether the inductive bias in self-attention is essential to the remarkable effectiveness of Transformers.”

OK-OK. I know this sounds a bit scientific, so let’s understand the sentences above. Inductive bias in machine learning — basically the notion that the model would be able to predict something good from new (not used during training) examples. Spatial interactions — just a reference to the interactions between the spatial information (which is information that has some form of “position”. In NLP, this might be words and their relative position to other words in a text). Dynamically parameterized based on the input representation — this is just saying that the interactions between the words change depending on the input data and our attention changes as well (as we have seen the example for the word “it” above). Also, the attention mechanism dynamically changes depending on the inputs, while MLP learns a fixed weight matrix which the model uses during inference. The next sentence states that Multi-Layered Perceptrons can in theory generalize over arbitrary functions (Duh..)

So, gMLP was introduced to challenge transformers, to seek the answer to the question whether self-attention is a necessity or a mere decoration in Transformers’ architectures.

gMLP

The architecture is named so because it is an MLP with added gating. If you are not familiar with MLPs, then check out these tutorials with great explanations:

gMLP network consists of the stack of L identical in size and structure blocks (they propose to index a particular block Lx). Let n be the length of the sequence of tokens (for language modelling purposes and for simplicity, we may say that a word is a token). Let d be the dimension of the token (since machines don’t know how to read words, they work with numbers, and in machine learning we usually use embeddings to represent a particular token).

Like this:

Each vector (embedding) is a representation of a particular word

So let X define our embeddings matrix with the shape of [n×d]. Then, let U and V be the matrices that define linear projections along the channel dimension. For example, the dimension of our input embedding is 512. Then our dimension of matrix X is [512×n]. If we want to change the dimension (like [216×n]) we can use matrix multiplication, and the process would be called a linear projection. This article explains this wonderfully.

In our case, the workflow of the one block could be divided into 3 simple steps:

Z=σ(X×U)

We move our embeddings into space with the dimension we are interested in by multiplying initial matrix X by linear projection matrix U (for instance, the layer might have more or less amount of neurons than the embedding layer, so we might need to reshape the initial matrix X). Then we use some activation function σ for this layer. Nothing mind-boggling.

2. Znew=s(Z) Gated layer! (described later)

3. Y =Znew×V Output of the block (also reshaped by the linear projection matrix V)

So, let us focus on gating, since this method is rather interesting.

Spatial Gating Unit

We wish to understand how tokens (or words) are related to each other and capture this information. For this purpose, researchers proposed to use spatial gating unit (SGU), which could be described in several steps:

Remember Z from the first formula? Z is the information that’s passed into the spatial gating unit. We want to enable the cross-tokens interactions (to capture the relationships between the words).

Let’s use linear projection into n×n dimension firstly

Lets’s understand what we are doing here. We have the sequence of n tokens with dimension d of each token embedding (d can change after layer Z but it’s OK, since on this step we care mostly about the sequence length). We want to capture cross-token interactions. How can we do that? Let’s imagine we have some sort of n×n matrix which has high values in some of its cells if the interaction between these tokens are important. Let’s go ahead and consider an example: “I like watching TV I usually do it with my family” (punctuation is omitted for simplicity). Let’s imagine we have these values after the first layer (Z transformation):

Matrix Z (Let’s imagine these are the outputs and we can correspond the output vectors to the initial tokens). Do not mix this up with the initial embedding matrix X!

Now, if we want to capture important cross-token interactions for the word “it”, we can consider this (hypothetical) matrix of cross-interactions, where green refers to a higher value and moving to red — to a lower one. Of course, depending on where the SGU is placed, the matrix values and logic could differ.

Words like “watching”, “TV” and “do” could be considered as important ones for the token “it”, so we assign higher values to them in our n×n (here 11×11) matrix.

Matrix W in the formula above is the matrix we are looking for! And the product of this matrix and the matrix Z produces n×n matrix (don’t forget about token-specific biases b), which gives us information about important cross-token interactions.

2. Now, having such a wonderful matrix, let’s incorporate this knowledge into our Z matrix that we’ve received as the input to the SGU. We do this by element-wise multiplication (called linear gating):

Linear gating mechanism (Znew in the second formula above)

And this is the output of the SGU! Important cross-token information is captured, everyone is happy.

The last thing to do is to capture important information in the channel (or hidden) dimension, for which we use the last layer with some activation function like ReLU.

So, the whole gMLP block would look like this (with formulae):

Is it the same as attention?

While this matrix above looks similar to the attention mechanism, it’s not the same. In MLPs the weights stay the same during the inference, independent of the input. In attention, on the other hand, weights change depending on the input which sometimes leads to a better performance during inference, while making transformers much harder to train.

Outperforming Transformers?

Let’s see some results. First of all,

Computer vision

As we can see, gMLP models are almost the same as Transformers in terms of performance, but the number of parameters is much smaller, which means less training time, data (and money) for almost the same results! Of course, there were some nuances during training these models, so please, refer to the original paper to see specifics about regularisation, weight initialization, input preparation etc.

NLP

In the gMLP paper authors explored the architecture for masked language modelling task (MLM). Masked language modelling is a probabilistic approach to model the probability distribution over pieces of text (letters, subwords, words) given the neighbouring text. There exist different metrics for training such models (checkout this article), and authors chose one of the most popular ones — perplexity.

The input/output protocols were the same as for BERT training.

Researchers state that gMLPs learn Toeplitz-like matrices as a spatial weights.

It means that if we shift the sequence, the output shifts as well and correspondingly (shift invariance). As authors state:

In this case, the learned fW,b(·) acts like a 1-d convolution whose kernel size equals the entire sequence length (unlike depthwise convolution with channel-specific filters, here the same W is shared across channels).

Here, fW,b(·) is the function from above that’s being used in SGU part to understand cross-token relationships.

Now, we can compare the models (Transformers vs gMLP) in terms of perplexity and for SST-2 and MNLI datasets.

As we can see, when the model size increases, gMLP starts to outperform BERT in terms of perplexity, as well as on the SST-2 dataset, while still not being able to outperform this architecture on MNLI-m data. This could mean that for the latter dataset the assumptions about Transformers inductive bias suit better for the particular task during fine-tuning of the model. These results show us, that having self-attention or not, you can always increase the size of your model (which gMLP demonstrated in the table above) to improve the metrics on a downstream task.

Finally, researchers proposed to add a tiny attention to SGU block to see whether this would increase the model performance:

New SGU with tiny attention (1 head of size 64)

And by introducing this (and calling aMLP because of the attention+SGU hybrid), we finally have the model with smaller number of parameters and better performance than BERT:

But we can outperform all these models by having the EXTRA-large gMLP (see the last row). As we witnessed above, we can always increase the size of gMLP to increase our performance on a downstream tasks, independently of self-attention (since we don’t have it in gMLP), and the last row of the table above proves it! Smaller perplexity, better metrics on a downstream tasks, while the cost of it is bearing almost one billion parameters. So when we speak about practical, real-life applications, we would probably use either gMLP or aMLP.

Conclusion

gMLP is a great new architecture which challenges some aspects of Transformers and achieves either comparable or better performance in Computer Vision and NLP. In terms of scaling, gMLPs are also comparable to Transformers for increased datasets and computational efforts, so you can consider using this model for your applications safely!

Please, check out the original paper once again, because there are important and interesting details which were not mentioned here by me, but they might be important for you.

If you want to train gMLP model yourself, checkout the code here:

https://github.com/lucidrains/g-mlp-pytorch

And pay attention to this discussion, on which people left some comments regarding the implementation.

Important links (combined):

Original gMLP paper
https://jalammar.github.io/illustrated-transformer/ (Great Transformer Illustration)
https://machinelearningmastery.com/the-transformer-attention-mechanism/ (Transformers tutorial)
Huggingface (pre-trained NLP models)
https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/hello_t2t.ipynb (Notebook for playing with attention)
https://towardsdatascience.com/multilayer-perceptron-explained-with-a-real-life-example-and-python-code-sentiment-analysis-cb408ee93141 (MLP tutorial)
https://machinelearningmastery.com/neural-networks-crash-course/ (Another MLP tutorial)
https://towardsdatascience.com/illustrated-difference-between-mlp-and-transformers-for-tensor-reshaping-52569edaf89 (Article explaining difference between MLP and Transformer reshaping)
https://thegradient.pub/understanding-evaluation-metrics-for-language-models/ (Evaluation metrics for NLP)
BERT
SST-2 description
MNLI dataset
https://github.com/lucidrains/g-mlp-pytorch (gMLP PyTorch implementation)

Thank you for your attention :)