Activation function and GLU variants for Transformer models

8 min readApr 18, 2022

Characterizing the first week of April 2022 as happening in the field of AI and Deep Learning would be an understatement. Within the same week, Google and OpenAI showcased their models PaLM⁶ and DALLE 2.

PaLM is a 540 billion parameter transformer-based language model seemingly capable of a state-of-the-art performance on a multitude of tasks in natural language. In addition to those, breakthrough capabilities are also demonstrated in reasoning tasks. DALLE 2 is an AI model which can create realistic images and art from a description in natural language, the architecture of which is also based on a transformer.

The results of both these models are nothing short of astonishing and if you haven’t already, do check them out.

Now, this article is not about those models, their overall architecture or wonderful results, but just on a very small but important aspect of these or any deep learning models, the activation function. Also, we will look into how variants of Gated Linear Units (GLU) (I’ll define this later) improve these models. I’ll assume that you have seen activation functions(sigmoid and ReLU) in practice and how it is applied in a perceptron of a neural network.

Sigmoid and ReLU

Activation functions are a way to introduce nonlinearity to neural networks. Without this non-linearity, the overall learned function would be linear which cannot account for the complex nonlinear relationships that a neural network needs to learn.

In the initial days of deep learning, the sigmoid was used as an activation function with its functional form:

But it came with its own issues, primarily which were outright failure to converge or very slow convergence.

ReLU or Rectified Linear unit seemed to ‘rectify’ those issues and has been the standard ever since. The functional form of ReLU is:

The main reason for ReLu being used is that it is simple, fast, and empirically it seems to work well.

But with the emergence of Transformer based models, different variants of activation functions and GLU have been experimented with and do seem to perform better. Some of them are:

GeLU²
Swish¹
GLU³
GEGLU⁴
SwiGLU⁴

We will go over some of these in detail but before that let’s see where exactly are these activations utilized in a Transformer architecture.

Feed Forward Layer in a Transformer

The Transformer⁵ alternates between a multi-head attention layer and what is called position-wise feed-forward networks(FFN). The encoder architecture from the original transformer paper:

The encoder in a Transformer architecture

I will not be going into the details of the attention layer since it's not important for our discussion. I will provide links below to some excellent resources if you want to get into the details of it.

Now, we will simplify and just look at the Attention and FFN layer.

The attention layer takes in a sequence of vectors (Input embeddings) and outputs a similar length sequence of vectors. The FFN takes the vectors and passes them through two linear learned transformations (represented by matrices W₁ and W₂ and bias b₁ and b₂). A ReLU activation function is applied between the two linear transformations (in other words matrix multiplications).

Functionally it can be represented as:

In the original transformer paper, the author described the above operation as two convolutions with kernel size 1. The dimensionality of input and output remains the same.

In subsequent implementations, It was further refined to a version without bias with no loss in performance:

Visualizing the simplified pass from text through the self-attention block to FFN and output:

I have made some simplification in the embedding layer and also not shown the Add and Norm layer in the pass above to make it simpler. But this captures essentially what occurs and where the activation function, in this case, ReLU is being used.

ReLU was still being used in the original transformer paper but subsequently, we have seen some other activations being utilized. Two of the most used ones are:

GELU

The functional form of a Gaussian error linear unit is:

where erf is the Gaussian error function given by:

Now, if all these equations don’t make a lot of sense, let's visualize them graphically. A graph of GELU with ReLU is shown below:

GELU combines the effect of the dropout, zone out, and ReLUs. ReLU and dropout yield a neuron’s output to zero or one with ReLU performing this deterministically and dropout doing this stochastically. So in this setting inputs have a higher probability of being dropped as x decreases so the transformation applied to x is stochastic and depends on the input as well. GELU overcomes the limitation of ReLU being non-differentiable at zero.

Across several experiments with MNIST, CIFAR-10, and other datasets as showcased in the paper, GELU outperforms other activation functions in terms of test accuracy while it also bears semblance to ReLU.

GELU is the activation that has been used in the GPT large language models by OpenAI. So as shown earlier in the feed-forward pass, GELU was used in place of the ReLU.

Swish

The functional form of Swish¹ is:

Here β is a parameter that can be trained.

Graphing out Swish w.r.t ReLU for different values of β:

As we see that with an increasing value of β, Swish resemblance becomes closer to ReLU. Swish can be loosely viewed as a smooth function that non-linearly interpolates between the linear function and the ReLU function. The degree of interpolation can be controlled by the model if β is set as a trainable parameter.

Like GELU, Swish is also differentiable at zero.

Swish was found using automatic search techniques to discover novel activation functions with strong empirical performance. Replacing Swish with ReLU in deeper models with various experiments done on CIFAR, Imagenet, and Machine translation tasks yielded better results.

Gated Linear Units

GLUs were introduced in a language modeling paper, which is a neural network layer defined as the component-wise product of two linear transformations(matrix multiplication) of the input one of which was Sigmoid activated. This was the first time a non-recurrent approach was competitive with strong recurrent models on some large-scale language tasks before the advent of transformers.

Here we see we have two trainable matrices W and V with V being used to calculate the gated unit. The gate provides an additional filter after the activation which can be learned during training and depends on the input itself. The ⊗ operation is the element-wise multiplication.

Visualizing GLU in terms of matrix operations without the bias matrices b and c:

We see that the overlapping matrix entries as shown in the last operation above are multiplied together so the output of xV+c acts as a filter for the other half of the operation. So depending on what the matrix values are in the filter, those same entries become prominent or are diminished from the sigmoid activation matrix.

ReGLU, GEGLU and SwiGLU

Instead of sigmoid, we can use other activations as well in the GLU. ReGLU, GEGLU, and SwiGLU are exactly as the name suggests, the sigmoid being replaced with those activations respectively.

As we saw earlier, the FFN of a transformer using ReLU activation was:

Replacing the ReLU part in FFN with the GLU variants we’ll have:

We observe that all these layers have three weight matrices(W, V, and W₂), as opposed to two in the original FFN. So, to keep the amount of computation constant, the number of hidden units(the second dimension of W and V, and the first dimension of W₂) is reduced by a factor of 2/3 and then the comparison is made on the performance.

These GLU variants performed better in many downstream language understanding tasks and these were simple architectural changes without any computational drawbacks. As to why they seem to perform better, we have a quote from the paper,

We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence.

Owing to these results and a significant increase in quality compared to standard ReLU, GeLU, or Swish activations, the authors of the PaLM⁶ model decided to use SwiGLU activation for Multilayer perceptron(MLP) intermediate activations.

What’s next?

ReLU is still one of the most used activation function as it is fast, simple, and works well for a whole range of tasks. But with GELU and Swish, we have viable alternatives with complementing properties that seem to work better in a lot of tasks. Combining these activations with GLU and using them in the FFN of a transformer’s attention layer leads to quality improvements over the typically used ReLU or GELU activations.

If anything, this highlights the importance of a carefully chosen activation function to the overall performance of the model in terms of training time and evaluation metrics on test data.

Have we arrived at the perfect activation(at least for transformer models) in these GLU variants? I strongly believe, not.

Is there a perfect activation function that would work well for all kinds of models?

Or maybe there would be an entirely novel way to introduce non-linearity and emulate the properties of an activation function that won’t depend entirely on the person training the model.

With new methods being tried out and experimented with, maybe we will find out.

If you’re interested in the transformer architecture, links to some amazing resources along with the original paper are below:

https://arxiv.org/abs/1706.03762

https://jalammar.github.io/illustrated-transformer/

https://medium.com/towards-data-science/attn-illustrated-attention-5ec4ad276ee3