GLU: Gated Linear Unit implementation

Alvaro Durán Tovar
Deep Learning made easy
3 min readDec 3, 2020

From paper to code

Photo by Dima Pechurin on Unsplash

I have started looking into an architecture called TabNet that aims for tabular problems interpretability. As part of it I’ll do couple of posts about some of its components, in this case about GLU activation (gated linear units). Next one will be about Ghost BatchNorm.

Gated Linear Unit

Related papers:
* TabNet: https://arxiv.org/abs/1908.07442
* Language modeling with Gated Convolutional Networks: https://arxiv.org/abs/1612.08083

The idea is simple. I want to allow the network to decide how much information should flow through a given path, like a logical gate, hence the name. How?

  • If we multiply X by 0, nothing passes.
  • If we multiply X by 1, everything passes.
  • If we multiply X by 0.5, half of it passes.

Does this reminds you something? It does right? LSTMs! Because it’s inspired by the idea of the gates of LSTMs, but applied to convolutions and linear layers, but it’s the same idea.

How can we obtain a value between 0 and 1?

C’mon that’s an easy one!… Yes! The sigmoid function!

Implementation

The formula from the paper looks as this:

Sigma means the sigmoid function. So we have two set of weights W and V, and two biases, b and c. One naive way to implement this is:

  • X*W + b is just a linear transformation, we can use a linear layer for it.
  • Same for X*V + c.
  • Then apply the sigmoid to one of them and we are done.

Making it faster

I said naive because there is a better (faster) way to implement it, using a fancy name like FastGLU it becomes:

What’s the difference? On the first snippet we do two matrix multiplications (when calling linear1 and linear2), and for this last one only one matrix multiplication, followed by splitting the output appropriately, but it’s the same thing.

How well does this compared with other activations?

From the paper:

[…]
In contrast, the gradient of the gated linear unit ∇[X ⊗ σ(X)] = ∇X ⊗ σ(X) + X ⊗ σ 0 (X)∇X (3) has a path ∇X ⊗ σ(X) without downscaling for the activated gating units in σ(X). This can be thought of as a multiplicative skip connection which helps gradients flow through the layers.
[…]

It reminds me ResNets.

Experiment time

I run some experiments to know what will happen if I compare it other activations: Sigmoid (because we are using a sigmoid here) and ReLU (because is the de facto standard).

If we ignore green and purple by now (2 linear layers + relu) we can see that FastGLU and SlowGLU versions perform quite similar, and better than the sigmoid version.

The purple line is so bumpy because values are too small (y axis is in log scale), it doesn’t really change much. Green line suffers from dead relus (must read https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b).

Notebook can be found here: https://github.com/hermesdt/machine_learning/blob/master/GLU.ipynb

Conclusions

Not much conclusions to made from this to be honest, but here are some:

  • ReLU actually for this specific case did amazingly well, but it’s very unstable. This is just a toy to play with, but still this applies to real world project, there is no silver bullet.
  • GLU is far more stable than ReLU and learns faster than sigmoid.
  • I already tried with other fancy activations like Mish, not better than GLU here.
  • Watch out for dead ReLU.

--

--