Recurrent Models Overview

Recurrent Layers: SimpleRNN, LSTM, GRU

Jake Batsuuri
Computronium Blog
7 min readMar 15, 2021

--

What’s SimpleRNN?

SimpleRNN is the recurrent layer object in Keras.

Remember that we input our data point, for example the entire length of our review, the number of timesteps.

Now the SimpleRNN processes data in batches, just like every other neural network. So this layer takes as input the tensor (batch_size, timesteps, input_features).

All recurrent layers in keras can be run in two different modes:

  • All of the successive outputs aka the states
  • Just the last state

Let’s look at writing a simple recurrent model:

The model summary is:

We can see that the output here is the last state. If instead we enable all states to be returned.

Then the model looks like:

How do you increase the representational power of a network?

Remember that the Recurrent Layer combines the current input with the previous output, thereby preserving some kind of relationship between the time steps, given that the data is sequential.

Remember that with Convolutional Layers, stacking them makes them learn ever more abstract concepts with each stack up.

For a keras model, to do this, we have to return full states between the hidden layers:

Although, spatial hierarchy is an essential property of deep convolutional models, deepening recurrent models, doesn’t really give a theoretical boost to our model. In practice however, increasing the representational power does improve performance on certain tasks.

It just depends on whether the data supplied has higher order abstract patterns that can be learned. Most times it doesn’t.

How do you train a RNN?

When we plot the results of this model training, we get:

Accuracy of about 85%. Which isn’t all that great… Hence the need for something more powerful.

When we try to find out why, we realize that our recurrence has a bit of a decaying shape when it comes to retaining information from previous inputs. The last state, it will remember, but bunch of steps ago? Fuggetaboutit.

This is due to the vanishing gradient problem, as a network gets deeper, the signal to update almost disappears from further as in deeper away elements.

What’s the LSTM Layer?

The Long Short Term Memory layers were invented to solve the vanishing gradient problem. Simplistic analogous explanation would be that, imagine having a conveyer belt beside your unrolled recurrent layer. On this conveyer belt you could store all the previous states and the current time step can take any weighted value of these previous states. Even stuff from long ago, that would have normally pretty much vanished.

Code wise, it looks like this:

If a simple RNN had as input:

  • Input
  • State from previous

The LST has as input:

  • Input
  • State from previous
  • Long term information carrier

Let’s see it’s performance:

This gives us an accuracy of 89%.

What’s the GRU Layer?

GRU and LSTM both try to solve the vanishing gradient problem. Accuracy wise they’re on par with each other, however GRU might be a bit more efficient. A simple GRU RNN might look like:

GRUs are simpler and may sometimes be the preferred solution for more language specific practitioners. But others will say LSTM is more sophisticated so theoretically should offer better accuracy. The only consensus is that they are comparable, and you should probably try both on your current task. While favoring LSTM for language related and more complex tasks and GRU for simpler tasks with fewer data points.

Let’s see an example of GRU with temporal data:

This is a dataset of temperature, pressure and humidity etc. Since its in an CSV, let’s put it into an array:

For our prediction problem we define the following:

  • delay = 144 — Targets will be 24 hours in the future, since the data point is in 10 minutes and there’s 6 of those in an hour
  • lookback = 720 — Observations will go back 5 days
  • steps = 6 — Observations will be sampled at one data point per hour

It’s always a good idea to normalize our data:

Our model does everything in batches, so we’ll need to create batches from our array:

Just like every machine learning problem we’ll generate 3 sets of data:

  • Training
  • Validating
  • Testing

Our model is:

This approach doesn’t regularize so let’s add a recurrent dropout:

A model like this gets us about a MAE, mean absolute error, of 0.265. Which when we denormalize is 2.35 degrees. That’s a pretty okay prediction. We can tune this model further by stacking more layers or hyperparameter optimization.

Other Articles

Up Next…

Coming up next is probably Recurrent Neural Networks and LSTM Layers. If you would like me to write another article explaining a topic in-depth, please leave a comment.

For the table of contents and more content click here.

--

--