# Recurrent Models Overview

## Recurrent Layers: SimpleRNN, LSTM, GRU

# What’s SimpleRNN?

SimpleRNN is the recurrent layer object in Keras.

`from keras.layers import SimpleRNN`

Remember that we input our data point, for example the entire length of our review, the number of timesteps.

Now the SimpleRNN processes data in batches, just like every other neural network. So this layer takes as input the tensor (batch_size, timesteps, input_features).

All recurrent layers in keras can be run in two different modes:

- All of the successive outputs aka the states
- Just the last state

Let’s look at writing a simple recurrent model:

from keras.models import Sequential

from keras.layers import Embedding, SimpleRNNmodel = Sequential()

model.add(Embedding(10000, 32))model.add(SimpleRNN(32))model.summary()

The model summary is:

`________________________________________________________________`

Layer (type) Output Shape Param #

================================================================

embedding_22 (Embedding) (None, None, 32) 320000

________________________________________________________________

simplernn_10 (SimpleRNN) **(None, 32)** 2080

================================================================

Total params: 322,080

Trainable params: 322,080

Non-trainable params: 0

We can see that the output here is the last state. If instead we enable all states to be returned.

`model = Sequential()`

model.add(Embedding(10000, 32))

**model.add(SimpleRNN(32, return_sequences=True))**

model.summary()

Then the model looks like:

`________________________________________________________________`

Layer (type) Output Shape Param #

================================================================

embedding_23 (Embedding) (None, None, 32) 320000

________________________________________________________________

simplernn_11 (SimpleRNN) **(None, None, 32)** 2080

================================================================

Total params: 322,080

Trainable params: 322,080

Non-trainable params: 0

# How do you increase the representational power of a network?

Remember that the Recurrent Layer combines the current input with the previous output, thereby preserving some kind of relationship between the time steps, given that the data is sequential.

Remember that with Convolutional Layers, stacking them makes them learn ever more abstract concepts with each stack up.

For a keras model, to do this, we have to return full states between the hidden layers:

`model = Sequential()`

model.add(Embedding(10000, 32))

**model.add(SimpleRNN(32, return_sequences=True))**

model.add(SimpleRNN(32, return_sequences=True))

model.add(SimpleRNN(32, return_sequences=True))

model.add(SimpleRNN(32))

model.summary()

Although, spatial hierarchy is an essential property of deep convolutional models, deepening recurrent models, doesn’t really give a theoretical boost to our model. In practice however, increasing the representational power does improve performance on certain tasks.

It just depends on whether the data supplied has higher order abstract patterns that can be learned. Most times it doesn’t.

# How do you train a RNN?

from keras.layers import Densemodel = Sequential()

model.add(Embedding(max_features, 32))

model.add(SimpleRNN(32))

model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])history = model.fit(input_train, y_train, epochs=10, batch_size=128, validation_split=0.2)

When we plot the results of this model training, we get:

import matplotlib.pyplot as pltacc = history.history['acc']

val_acc = history.history['val_acc']

loss = history.history['loss']

val_loss = history.history['val_loss']epochs = range(1, len(acc) + 1)plt.plot(epochs, acc, 'bo', label='Training acc')

plt.plot(epochs, val_acc, 'b', label='Validation acc')

plt.title('Training and validation accuracy')

plt.legend()plt.figure()plt.plot(epochs, loss, 'bo', label='Training loss')

plt.plot(epochs, val_loss, 'b', label='Validation loss')

plt.title('Training and validation loss')

plt.legend()plt.show()

Accuracy of about 85%. Which isn’t all that great… Hence the need for something more powerful.

When we try to find out why, we realize that our recurrence has a bit of a decaying shape when it comes to retaining information from previous inputs. The last state, it will remember, but bunch of steps ago? Fuggetaboutit.

This is due to the vanishing gradient problem, as a network gets deeper, the signal to update almost disappears from further as in deeper away elements.

# What’s the LSTM Layer?

The Long Short Term Memory layers were invented to solve the vanishing gradient problem. Simplistic analogous explanation would be that, imagine having a conveyer belt beside your unrolled recurrent layer. On this conveyer belt you could store all the previous states and the current time step can take any weighted value of these previous states. Even stuff from long ago, that would have normally pretty much vanished.

Code wise, it looks like this:

output_t = activation(dot(state_t, Uo) + dot(input_t, Wo) +dot(C_t, Vo)+ bo)i_t = activation(dot(state_t, Ui) + dot(input_t, Wi) + bi)

f_t = activation(dot(state_t, Uf) + dot(input_t, Wf) + bf)

k_t = activation(dot(state_t, Uk) + dot(input_t, Wk) + bk)c_t+1 = i_t * k_t + c_t * f_t

If a simple RNN had as input:

- Input
- State from previous

The LST has as input:

- Input
- State from previous
- Long term information carrier

Let’s see it’s performance:

from keras.layers import LSTMmodel = Sequential()

model.add(Embedding(max_features, 32))model.add(LSTM(32))model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])history = model.fit(input_train, y_train, epochs=10, batch_size=128, validation_split=0.2)

This gives us an accuracy of 89%.

# What’s the GRU Layer?

GRU and LSTM both try to solve the vanishing gradient problem. Accuracy wise they’re on par with each other, however GRU might be a bit more efficient. A simple GRU RNN might look like:

from keras.models import Sequential

from keras import layers

from keras.optimizers import RMSpropmodel = Sequential()model.add(layers.GRU(32, input_shape=(None, float_data.shape[-1])))model.add(layers.Dense(1))model.compile(optimizer=RMSprop(), loss='mae')history = model.fit_generator(train_gen, steps_per_epoch=500, epochs=20, validation_data=val_gen, validation_steps=val_steps)

GRUs are simpler and may sometimes be the preferred solution for more language specific practitioners. But others will say LSTM is more sophisticated so theoretically should offer better accuracy. The only consensus is that they are comparable, and you should probably try both on your current task. While favoring LSTM for language related and more complex tasks and GRU for simpler tasks with fewer data points.

Let’s see an example of GRU with temporal data:

import osdata_dir = '/users/fchollet/Downloads/jena_climate'

fname = os.path.join(data_dir, 'jena_climate_2009_2016.csv')f = open(fname)

data = f.read()

f.close()lines = data.split('\n')

header = lines[0].split(',')

lines = lines[1:]print(header)

print(len(lines))

This is a dataset of temperature, pressure and humidity etc. Since its in an CSV, let’s put it into an array:

import numpy as npfloat_data = np.zeros((len(lines), len(header) - 1))

for i, line in enumerate(lines):

values = [float(x) for x in line.split(',')[1:]]

float_data[i, :] = values

For our prediction problem we define the following:

- delay = 144 — Targets will be 24 hours in the future, since the data point is in 10 minutes and there’s 6 of those in an hour
- lookback = 720 — Observations will go back 5 days
- steps = 6 — Observations will be sampled at one data point per hour

It’s always a good idea to normalize our data:

`mean = float_data[:200000].mean(axis=0)`

float_data -= mean

std = float_data[:200000].std(axis=0)

float_data /= std

Our model does everything in batches, so we’ll need to create batches from our array:

def generator(data,lookback, delay,min_index, max_index, shuffle=False, batch_size=128,step=6):

if max_index is None:

max_index = len(data) - delay - 1

i = min_index + lookbackwhile 1:

if shuffle:

rows = np.random.randint(min_index + lookback, max_index, size=batch_size)

else:

if i + batch_size >= max_index:

i = min_index + lookback

rows = np.arange(i, min(i + batch_size, max_index))

i += len(rows)

samples = np.zeros((len(rows), lookback // step, data.shape[-1]))

targets = np.zeros((len(rows),))for j, row in enumerate(rows):

indices = range(rows[j] - lookback, rows[j], step)

samples[j] = data[indices]

targets[j] = data[rows[j] + delay][1]

yield samples, targets

Just like every machine learning problem we’ll generate 3 sets of data:

- Training
- Validating
- Testing

lookback = 1440

step = 6

delay = 144

batch_size = 128train_gen = generator(float_data, lookback=lookback, delay=delay, min_index=0, max_index=200000, shuffle=True, step=step, batch_size=batch_size)val_gen = generator(float_data, lookback=lookback, delay=delay, min_index=200001, max_index=300000, step=step, batch_size=batch_size)test_gen = generator(float_data, lookback=lookback, delay=delay, min_index=300001, max_index=None, step=step, batch_size=batch_size)# How many steps to draw from val_gen in order to see the entire validation set

val_steps = (300000 - 200001 - lookback)

# How many steps to draw from test_gen in order to see the entire test set

test_steps = (len(float_data) - 300001 - lookback)print(val_steps) #98559

Our model is:

from keras.models import Sequential

from keras import layers

from keras.optimizers import RMSpropmodel = Sequential()model.add(layers.GRU(32, input_shape=(None, float_data.shape[-1])))model.add(layers.Dense(1))model.compile(optimizer=RMSprop(), loss='mae')history = model.fit_generator(train_gen, steps_per_epoch=500, epochs=20, validation_data=val_gen, validation_steps=val_steps)

This approach doesn’t regularize so let’s add a recurrent dropout:

from keras.models import Sequential

from keras import layers

from keras.optimizers import RMSpropmodel = Sequential()model.add(layers.GRU(32, dropout=0.2, recurrent_dropout=0.2, input_shape=(None, float_data.shape[-1])))model.add(layers.Dense(1))model.compile(optimizer=RMSprop(), loss='mae')history = model.fit_generator(train_gen, steps_per_epoch=500, epochs=40, validation_data=val_gen, validation_steps=val_steps)

A model like this gets us about a MAE, mean absolute error, of 0.265. Which when we denormalize is 2.35 degrees. That’s a pretty okay prediction. We can tune this model further by stacking more layers or hyperparameter optimization.

# Other Articles

This post is part of a series of stories that explores the fundamentals of deep learning:1.Linear Algebra Data Structures and OperationsObjects and Operations2.Computationally Efficient Matrices and Matrix Decompositions

Inverses, Linear Dependence, Eigen-decompositions, SVD3.Probability Theory Ideas and ConceptsDefinitions, Expectation, Variance4.Useful Probability Distributions and Structured Probabilistic ModelsActivation Functions, Measure and Information Theory5.Numerical Method Considerations for Machine Learning

Overflow, Underflow, Gradients and Gradient Based Optimizations6.Gradient Based Optimizations

Taylor Series, Constrained Optimization, Linear Least Squares7.Machine Learning Background Necessary for Deep Learning I

Generalization, MLE, Kullback-Leibler Divergence8.Machine Learning Background Necessary for Deep Learning II

Regularization, Capacity, Parameters, Hyper-parameters9.Principal Component Analysis Breakdown

Motivation, Derivation10.Feed-forward Neural Networks

Layers, definitions, Kernel Trick11.Gradient Based Optimizations Under The Deep Learning Lens

Stochastic Gradient Descent, Cost Function, Maximum Likelihood12.Output Units For Deep Learning

Stochastic Gradient Descent, Cost Function, Maximum Likelihood13.Hidden Units For Deep Learning

Activation Functions, Performance, Architecture14.The Common Approach to Binary ClassificationThe most generic way to setup your deep learning models to categorize movie reviews15.General Architectural Design Considerations for Neural NetworksUniversal Approximation Theorem, Depth, Connections16.Classifying Text Data into Multiple ClassesSingle-Label Multi-class Classification17.Convolutional Models OverviewConvolutions, Kernels, Downsampling & Properties18.Working Understanding of Convolutional Models

Creating, Preprocessing, Data Augmentation, Feature Extraction, Fine Tuning19.Convolutional Models for Sequential Data

And easing into Recurrent Neural Networks20. Recurrent Models Overview

Recurrent Layers: SimpleRNN, LSTM, GRU

# Up Next…

Coming up next is probably **Recurrent Neural Networks and LSTM Layers**. If you would like me to write another article explaining a topic in-depth, please leave a comment.

For the table of contents and more content click here.

# References

Adams, R. A. (2017). *Calculus*. Prentice-Hall.

François, C. (2018). *Deep Learning with Python and Keras*. MITP-Verlags GmbH & Co. KG.

Goodfellow, I. (2017). *Deep Learning*. MIT Press.

Nicholson, K. (2009). *Linear Algebra with Applications*.

Sutton, R. S. (2018). *Reinforcement Learning*. A Bradford Book.

Wackerly, D. D. (2007). *Mathematical Statistics with Applications*. Belmont, CA: Nelson Education.

(n.d.). A First Course In Linear Algebra — Open Textbook Library. Retrieved February 24, 2020, from https://open.umn.edu/opentextbooks/textbooks/a-first-course-in-linear-algebra-2017