GRUs and LSTMs for Natural Language Processing

Published in

CodeX

5 min readNov 6, 2021

Gated Recurrent Units (GRU) and Long Short Term Memory (LSTM) are recurrent neural networks (RNN) that provide an improvement over vanilla RNNS and have proven quite useful in learning tasks for natural language processing.

This article builds off my RNNs for Natural Language Processing article, which is recommended reading as a prerequisite.

RNNs for Natural Language Processing

RNNs are essential for understanding current state-of-the architectures like BERT, T5, or GPT-3

medium.com

Where Vanilla RNNs Fall Short

As described in RNNs for Natural Language Processing, vanilla RNNs have a shortcoming of losing the context of earlier words in longer sequences. This is commonly referred to as the vanishing gradient. For example, consider this script from Pulp Fiction:

“Jules: I think her biggest deal was she starred in a pilot.

Vincent: What’s a pilot?

Jules: Well, you know the shows on TV?

Vincent: I don’t watch TV.

Jules: Yes, but you’re aware that there’s an invention called television, and on that invention they show shows?

Vincent: Yeah.

Jules: Well, the way they pick the shows on TV is they make one show, and that show’s called a pilot. And they show that one show to the people who pick the shows, and on the strength of that one show, they decide if they want to make more shows. Some get accepted and become TV programs, and some don’t, and become nothing. She starred in one of the ones that became nothing.”

Let’s say we wish to build a NLP system to provide grammar and spelling suggestions. The system will take in the Pulp Fiction script and highlight words that don’t follow grammar conventions, much like Microsoft Word does.

Consider the last sentence in the script: She starred in one of the ones that became nothing. How would the system determine that ‘She’ is the correct pronoun? Reading the script, one can discern she is the correct pronoun by the first sentence: I think her biggest deal was she starred in a pilot.

However, as you can see, the first sentence needed for determining the correct pronoun occurs much earlier in the script. A vanilla RNN would dilute this context, as it has to pass its hidden state through each unit of the input, which in this case, is about 100 words. These 100 words are not useful for determining the correct pronoun and drown out the signal from the first sentence.

GRUs and LSTMs attempt to solve this problem by adding specialized gates to the network in order to ‘remember’ earlier parts of the sequence more easily and ‘forget’ irrelevant parts of the sequence.

Gated Recurrent Networks

Recall from RNNs for Natural Language Processing, the basic unit of a vanilla RNN can look something like:

A hidden state, h^t, is passed from one recurrent unit to another, and at each step, an output, ŷ, is produced. In each unit, the hidden state is always updated to a new value based on the learned weights and biases in the network. A GRU adds some additional steps in the recurrent network to enable the hidden state to optionally update the hidden state at each unit. This means that the hidden state can be directly passed through with almost no change.

A simplified version of the Gated Recurrent Unit can be summarized as:

Similar to our vanilla RNN, concatenate the hidden state vector, h, and the input vector, x to create: [x^t, h^t]
Make two copies of the concatenated vector
Similar to our vanilla RNN, multiply weights (Wh), add biases (b_h), and use the tanh activation function on one copy of the concatenated vector - this will be our candidate value (c^n) that we may update the hidden state with.
Introduce a new set of weights (Wu) and biases (b_u), called the update weights/biases and multiply the second copy of the concatenated vector and put this result through the sigmoid function. Since we use the sigmoid function, the result will be a number either very close to 1, or 0. This result is referred to as the update value, u
We then use this update value u, to determine if we should propagate the previous hidden state as the new hidden state, or update the hidden state with our candidate value. This can represented by the following equation, which keeps the previous hidden state if u is 0 or updates the hidden state with the candidate value if u is 1.

6. Finally, to get the current output value (ŷ) feed the result into the softmax function.

The following diagram provides a visual reference for each step in the simple GRU

This simplified GRU can also be expressed in the series of equations (* is denoted as element-wise multiplication):

The full version of the GRU uses an additional ‘relevance’ gate, which allows the network to learn how relevant the hidden state is for each unit.

In summary, we allow the network to learn some additional parameters in order to let the network ignore certain inputs in the sequence and be more resilient to longer sequences with more noise.

Long-Short Term Memory (LSTM)

Similar to the GRU, LSTMs enable networks to more easily ignore irrelevant parts of a sequence. However, LSTMs are a bit more complex, using an additional gate to achieve a similar effect to GRUs. Three gates are computed with separate weights and biases: the update gate, u, the forget gate, f, and the output gate, o. Instead of multiplying the previous hidden state by the opposite of the update gate (1 - u), the forget gate is used instead. Furthermore, the output gate is used to weight the resulting output from the update and forget gates. This can be represented in the following equations:

Useful Resources

Andrew Ng’s course on sequence models (all videos can be viewed for free by clicking ‘audit course’ https://www.coursera.org/learn/nlp-sequence-models