Gated Linear Unit — Enabling stacked convolutions to out-perform RNNs

Pragyan Subedi
3 min readFeb 9, 2024

--

This article is a concise explanation of the Gated Linear Unit (GLU) based gating mechanism introduced in the Language Modeling with Gated Convolutional Networks paper.

Main Idea

“The pre-dominant approach to language modeling to date is based on recurrent neural networks. Their success on this task is often linked to their ability to capture unbounded context

We propose a novel simplified gating mechanism that outperforms Oord et al. (2016b) and investigate the impact of key architectural decisions.

To our knowledge, this is the first time a non-recurrent approach is competitive with strong recurrent models on these large scale language tasks.” — Research Paper

Replacing Recurrent Neural Networks with Convolutions For Language Modeling

Convolutions provide two explicit benefits over Recurrent Neural Networks for Language modeling

1. Convolutions can capture long term dependencies through stacking at O(N/k) time complexity

Recurrent Neural Networks view all inputs in the form of a chain structure which in turn creates a time complexity to O(N) for a context size of N.

However, convolution networks can also represent large context sizes through stacking and extracting hierarchical features over larger and larger contexts with more abstractive features.

This allows them to model long-term dependencies by applying O(N/k) operations over a kernel width k.

2. Convolution operation can be parallelized across a neural network in contrast to sequential neural networks

n recurrent networks, the subsequent output relies on the prior hidden state, limiting the ability to parallelize computations across sequence elements.

Conversely, convolutional networks excel in this regard, as they can compute all input elements simultaneously.

Gating Convolutions using Gated Linear Unit

Recurrent Neural Networks use gating as a mechanism to capture long term dependencies by handling a vanishing gradient problem amongst other benefits such as smoothing gradient flow during backpropagation, or controlling information in a network.

Following this notion of achieving gating benefits in convolutional networks, Gradient Linear Units gate provide linear path for the gradients while retaining non-linear capabilities.

The approach is as follows:

  1. All input words are represented by a vector embedding stored in a lookup table D^{|V|×e} where |V| is the number of words in the vocabulary and e is the embedding size. The output of the embedding layer is therefore of shape, (vocab_size, embedding_dimension).
  2. The output of the embedding layer is passed onto the hidden layers for carrying out the convolution operation. Here, A = Word Embeddings (E) * Weights (W) + Bias (b) and B = Word Embeddings (E) * Weights (V) + Bias (c). Both Weights (W and V) belong to a set of real numbers (R) with dimensions k x m x n (k=Slices in channel/stride, m=Input Features/Number of rows, n=Output Features/Number of columns in each row).
  3. The gated output is obtained as an element-wise multiplication of A ⊗ sigmoid(B)
  4. Finally, the gated output is passed through a SoftMax activation layer with its own weight.

Putting it all together, the equation ℎ​l(X)=(X⋅W+b)⊗ σ(X⋅V+c) represents the output of a Gated Linear Unit.

It combines a linear transformation of the input with a gating mechanism, allowing the unit to selectively pass or block information based on the learned gating values. This architecture enables GLUs to capture complex patterns and dependencies within the data, making them valuable components in deep learning architectures.

Note: Convolutions are needed to be designed in a way that they do not look into the future during training. This is done by zero-padding the beginning of the sequence with k-1 elements (shifting the convolutional inputs to prevent the kernels from seeing future context).

Why do Gated Linear Units work?

“The output of each layer is a linear projection X ∗ W + b modulated by the gates σ(X ∗ V + c). Similar to LSTMs, these gates multiply each element of the matrix X∗W+b and control the information passed on in the hierarchy.” — Research Paper

In short, Gated Linear Units allows the network to focus on relevant features by selectively passing or blocking information. Having linear units coupled with the gates allows the retention of non-linear capabilities of the layer while allowing the gradient to propagate through the linear unit without scaling.

--

--