GRUs vs. LSTMs

Tiger Shen
Paper Club
Published in
6 min readAug 11, 2017

Notes on Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Overall impression: The authors seem to recognize that their study does not produce any novel ideas or breakthroughs (that’s okay! Not every study needs to). They prove that gated units are superior to vanilla recurrent units, which is a hypothesis that had already been independently proven. They are unable to clearly distinguish between the performance of the two gated units they tested. This paper helped most to solidify my interpretation of the definitions of, similarities, and differences between GRUs and LSTMs, and also re-set my flawed assumption that GRUs are superior just because they were developed later.

⁉️ Big Question

What techniques can be used to avoid the exploding and vanishing gradient problems in recurrent neural networks, thereby increasing their practicality and usefulness?

🏙 Background Summary

Recurrent Neural Networks (RNNs) were proposed several decades ago as an architecture that handles variable-length sequential input by way of a recurrent, shared hidden state. However, they were mostly impractical due to the vanishing and exploding gradient problems during training until the introduction of the long-short term memory (LSTM) recurrent unit as a more complex activation function to confidently capture long-term dependencies.

Even besides the vanishing and exploding gradient problems — how would vanilla RNNs ever work without being able to track long-term dependencies and past state?

The LSTM was followed by the Gated Recurrent Unit (GRU) and both have the same goal of tracking long-term dependencies effectively while mitigating the vanishing/exploding gradient problems. The LSTM does so via input, forget, and output gates; the input gate regulates how much of the new cell state to keep, the forget gate regulates how much of the existing memory to forget, and the output gate regulates how much of the cell state should be exposed to the next layers of the network. The GRU operates using a reset gate and an update gate. The reset gate sits between the previous activation and the next candidate activation to forget previous state, and the update gate decides how much of the candidate activation to use in updating the cell state.

Both LSTMs and GRUs have the ability to keep memory/state from previous activations rather than replacing the entire activation like a vanilla RNN, allowing them to remember features for a long time and allowing backpropagation to happen through multiple bounded nonlinearities, which reduces the likelihood of the vanishing gradient.

The backpropagation piece of this claim is non-intuitive to me, and I’d be glad for a clear explanation

LSTMs control the exposure of memory content (cell state) while GRUs expose the entire cell state to other units in the network. The LSTM unit has separate input and forget gates, while the GRU performs both of these operations together via its reset gate.

❓ Specific question(s)

  • Do RNNs using recurrent units with gates outperform vanilla networks?
  • Does the LSTM or the GRU perform better as a recurrent unit for tasks such as music and speech prediction?

💭 Approach

The authors will build an LSTM model, a GRU model, and a vanilla RNN model and compare their performances using a log-likelihood loss function over polyphonic music modelling and speech signal modelling datasets.

⚗️ Methods

The authors built models for each of their three test units (LSTM, GRU, tanh) along the following criteria:

  • similar numbers of parameters in each network, for fair comparison
  • RMSProp optimization
  • learning rate chosen through experimentation from 10 different points from -12 to -6

They tested their models across four music datasets and two speech datasets.

📓 Results

For the music datasets, all of the models performed relatively closely, with the GRU-RNN inching a bit ahead. For the speech datasets, the gated units well outperformed the tanh unit, with the GRU-RNN once again producing the best results both in terms of accuracy and training time. As this is the more challenging task, the authors derive more signal from it.

Seems like a leap to equate the more challenging task (speech prediction) with a higher signal. Is this okay? Performance is relative anyways, so the playing field is just as even for the hard task as the easier one.

However, the results are not enough to declare a winner between LSTMs and GRUs; this suggests that one or the other might be best suited to a given task based on the description of the task.

I was surprised to see the music dataset results; I was under the impression that gated units were a large improvement to the vanilla RNNs. I wish the author had addressed the vanishing/exploding gradient problem as they trained their vanilla RNN; this might be obvious to others, but it was not clear to me from reading the paper. My interpretation was “hey there’s a huge problem with vanilla RNNs but we’re training one and it’s fine”.

🤠 Conclusion

The authors seem to recognize that their study does not produce any novel ideas or breakthroughs (that’s okay! Not every study needs to). They prove that gated units are superior to vanilla recurrent units, which is a hypothesis that had already been independently proven. They are unable to clearly distinguish between the performance of the two gated units they tested. This paper helped most to solidify my interpretation of the definitions of, similarities, and differences between GRUs and LSTMs, and also re-set my flawed assumption that GRUs are superior just because they were developed later.

They do not point out flaws in their own study.

They vaguely propose further testing and experimentation on these gated units, perhaps on different datasets or tasks, as a next step.

⏩ Viability as a Project

This paper is comparing two existing techniques/architectures rather than proposing any novel ideas, so unless one were inclined to continue this research it would not be directly applicable as a project.

🔁 Abstract

The abstract does match what the authors said in the paper, and it does fit with my interpretation of the paper.

It is a bit unsatisfying for the conclusion to be that GRUs and LSTMs are “comparable”, but I suppose that’s better than manufacturing a reason for one to be superior to the other.

🗣 What do other researchers say?

  • Couldn’t find anything :/

🤷‍ Words I don’t know

  • logistic sigmoid function: the special case of the sigmoid function that goes from 0 to 1 along an s-curve
  • affine transformation: a function that preserves points, straight lines, planes, and parallel lines between two spaces with no origin. It does not necessarily preserve angles or distances.
  • polyphonic: producing many sounds simultaneously
  • log-uniform: a log that is uniformly distributed. i.e. log uniform distribution between 128 and 4000 would be as follows:

--

--