Recurrent Models Overview
Recurrent Layers: SimpleRNN, LSTM, GRU
What’s SimpleRNN?
SimpleRNN is the recurrent layer object in Keras.
from keras.layers import SimpleRNN
Remember that we input our data point, for example the entire length of our review, the number of timesteps.
Now the SimpleRNN processes data in batches, just like every other neural network. So this layer takes as input the tensor (batch_size, timesteps, input_features).
All recurrent layers in keras can be run in two different modes:
- All of the successive outputs aka the states
- Just the last state
Let’s look at writing a simple recurrent model:
from keras.models import Sequential
from keras.layers import Embedding, SimpleRNNmodel = Sequential()
model.add(Embedding(10000, 32))
model.add(SimpleRNN(32))model.summary()
The model summary is:
________________________________________________________________
Layer (type) Output Shape Param #
================================================================
embedding_22 (Embedding) (None, None, 32) 320000
________________________________________________________________
simplernn_10 (SimpleRNN) (None, 32) 2080
================================================================
Total params: 322,080
Trainable params: 322,080
Non-trainable params: 0
We can see that the output here is the last state. If instead we enable all states to be returned.
model = Sequential()
model.add(Embedding(10000, 32))
model.add(SimpleRNN(32, return_sequences=True))
model.summary()
Then the model looks like:
________________________________________________________________
Layer (type) Output Shape Param #
================================================================
embedding_23 (Embedding) (None, None, 32) 320000
________________________________________________________________
simplernn_11 (SimpleRNN) (None, None, 32) 2080
================================================================
Total params: 322,080
Trainable params: 322,080
Non-trainable params: 0
How do you increase the representational power of a network?
Remember that the Recurrent Layer combines the current input with the previous output, thereby preserving some kind of relationship between the time steps, given that the data is sequential.
Remember that with Convolutional Layers, stacking them makes them learn ever more abstract concepts with each stack up.
For a keras model, to do this, we have to return full states between the hidden layers:
model = Sequential()
model.add(Embedding(10000, 32))
model.add(SimpleRNN(32, return_sequences=True))
model.add(SimpleRNN(32, return_sequences=True))
model.add(SimpleRNN(32, return_sequences=True))
model.add(SimpleRNN(32))
model.summary()
Although, spatial hierarchy is an essential property of deep convolutional models, deepening recurrent models, doesn’t really give a theoretical boost to our model. In practice however, increasing the representational power does improve performance on certain tasks.
It just depends on whether the data supplied has higher order abstract patterns that can be learned. Most times it doesn’t.
How do you train a RNN?
from keras.layers import Densemodel = Sequential()
model.add(Embedding(max_features, 32))
model.add(SimpleRNN(32))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])history = model.fit(input_train, y_train, epochs=10, batch_size=128, validation_split=0.2)
When we plot the results of this model training, we get:
import matplotlib.pyplot as pltacc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']epochs = range(1, len(acc) + 1)plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()plt.figure()plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()plt.show()
Accuracy of about 85%. Which isn’t all that great… Hence the need for something more powerful.
When we try to find out why, we realize that our recurrence has a bit of a decaying shape when it comes to retaining information from previous inputs. The last state, it will remember, but bunch of steps ago? Fuggetaboutit.
This is due to the vanishing gradient problem, as a network gets deeper, the signal to update almost disappears from further as in deeper away elements.
What’s the LSTM Layer?
The Long Short Term Memory layers were invented to solve the vanishing gradient problem. Simplistic analogous explanation would be that, imagine having a conveyer belt beside your unrolled recurrent layer. On this conveyer belt you could store all the previous states and the current time step can take any weighted value of these previous states. Even stuff from long ago, that would have normally pretty much vanished.
Code wise, it looks like this:
output_t = activation(dot(state_t, Uo) + dot(input_t, Wo) + dot(C_t, Vo) + bo)i_t = activation(dot(state_t, Ui) + dot(input_t, Wi) + bi)
f_t = activation(dot(state_t, Uf) + dot(input_t, Wf) + bf)
k_t = activation(dot(state_t, Uk) + dot(input_t, Wk) + bk)c_t+1 = i_t * k_t + c_t * f_t
If a simple RNN had as input:
- Input
- State from previous
The LST has as input:
- Input
- State from previous
- Long term information carrier
Let’s see it’s performance:
from keras.layers import LSTMmodel = Sequential()
model.add(Embedding(max_features, 32))
model.add(LSTM(32))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])history = model.fit(input_train, y_train, epochs=10, batch_size=128, validation_split=0.2)
This gives us an accuracy of 89%.
What’s the GRU Layer?
GRU and LSTM both try to solve the vanishing gradient problem. Accuracy wise they’re on par with each other, however GRU might be a bit more efficient. A simple GRU RNN might look like:
from keras.models import Sequential
from keras import layers
from keras.optimizers import RMSpropmodel = Sequential()
model.add(layers.GRU(32, input_shape=(None, float_data.shape[-1])))
model.add(layers.Dense(1))model.compile(optimizer=RMSprop(), loss='mae')history = model.fit_generator(train_gen, steps_per_epoch=500, epochs=20, validation_data=val_gen, validation_steps=val_steps)
GRUs are simpler and may sometimes be the preferred solution for more language specific practitioners. But others will say LSTM is more sophisticated so theoretically should offer better accuracy. The only consensus is that they are comparable, and you should probably try both on your current task. While favoring LSTM for language related and more complex tasks and GRU for simpler tasks with fewer data points.
Let’s see an example of GRU with temporal data:
import osdata_dir = '/users/fchollet/Downloads/jena_climate'
fname = os.path.join(data_dir, 'jena_climate_2009_2016.csv')f = open(fname)
data = f.read()
f.close()lines = data.split('\n')
header = lines[0].split(',')
lines = lines[1:]print(header)
print(len(lines))
This is a dataset of temperature, pressure and humidity etc. Since its in an CSV, let’s put it into an array:
import numpy as npfloat_data = np.zeros((len(lines), len(header) - 1))
for i, line in enumerate(lines):
values = [float(x) for x in line.split(',')[1:]]
float_data[i, :] = values
For our prediction problem we define the following:
- delay = 144 — Targets will be 24 hours in the future, since the data point is in 10 minutes and there’s 6 of those in an hour
- lookback = 720 — Observations will go back 5 days
- steps = 6 — Observations will be sampled at one data point per hour
It’s always a good idea to normalize our data:
mean = float_data[:200000].mean(axis=0)
float_data -= mean
std = float_data[:200000].std(axis=0)
float_data /= std
Our model does everything in batches, so we’ll need to create batches from our array:
def generator(data, lookback, delay, min_index, max_index, shuffle=False, batch_size=128, step=6):
if max_index is None:
max_index = len(data) - delay - 1
i = min_index + lookbackwhile 1:
if shuffle:
rows = np.random.randint(min_index + lookback, max_index, size=batch_size)
else:
if i + batch_size >= max_index:
i = min_index + lookback
rows = np.arange(i, min(i + batch_size, max_index))
i += len(rows)
samples = np.zeros((len(rows), lookback // step, data.shape[-1]))
targets = np.zeros((len(rows),))for j, row in enumerate(rows):
indices = range(rows[j] - lookback, rows[j], step)
samples[j] = data[indices]
targets[j] = data[rows[j] + delay][1]
yield samples, targets
Just like every machine learning problem we’ll generate 3 sets of data:
- Training
- Validating
- Testing
lookback = 1440
step = 6
delay = 144
batch_size = 128train_gen = generator(float_data, lookback=lookback, delay=delay, min_index=0, max_index=200000, shuffle=True, step=step, batch_size=batch_size)val_gen = generator(float_data, lookback=lookback, delay=delay, min_index=200001, max_index=300000, step=step, batch_size=batch_size)test_gen = generator(float_data, lookback=lookback, delay=delay, min_index=300001, max_index=None, step=step, batch_size=batch_size)# How many steps to draw from val_gen in order to see the entire validation set
val_steps = (300000 - 200001 - lookback)
# How many steps to draw from test_gen in order to see the entire test set
test_steps = (len(float_data) - 300001 - lookback)print(val_steps) #98559
Our model is:
from keras.models import Sequential
from keras import layers
from keras.optimizers import RMSpropmodel = Sequential()
model.add(layers.GRU(32, input_shape=(None, float_data.shape[-1])))
model.add(layers.Dense(1))model.compile(optimizer=RMSprop(), loss='mae')history = model.fit_generator(train_gen, steps_per_epoch=500, epochs=20, validation_data=val_gen, validation_steps=val_steps)
This approach doesn’t regularize so let’s add a recurrent dropout:
from keras.models import Sequential
from keras import layers
from keras.optimizers import RMSpropmodel = Sequential()model.add(layers.GRU(32, dropout=0.2, recurrent_dropout=0.2, input_shape=(None, float_data.shape[-1])))
model.add(layers.Dense(1))model.compile(optimizer=RMSprop(), loss='mae')history = model.fit_generator(train_gen, steps_per_epoch=500, epochs=40, validation_data=val_gen, validation_steps=val_steps)
A model like this gets us about a MAE, mean absolute error, of 0.265. Which when we denormalize is 2.35 degrees. That’s a pretty okay prediction. We can tune this model further by stacking more layers or hyperparameter optimization.
Other Articles
This post is part of a series of stories that explores the fundamentals of deep learning:1. Linear Algebra Data Structures and Operations
Objects and Operations2. Computationally Efficient Matrices and Matrix Decompositions
Inverses, Linear Dependence, Eigen-decompositions, SVD3. Probability Theory Ideas and Concepts
Definitions, Expectation, Variance4. Useful Probability Distributions and Structured Probabilistic Models
Activation Functions, Measure and Information Theory5. Numerical Method Considerations for Machine Learning
Overflow, Underflow, Gradients and Gradient Based Optimizations6. Gradient Based Optimizations
Taylor Series, Constrained Optimization, Linear Least Squares7. Machine Learning Background Necessary for Deep Learning I
Generalization, MLE, Kullback-Leibler Divergence8. Machine Learning Background Necessary for Deep Learning II
Regularization, Capacity, Parameters, Hyper-parameters9. Principal Component Analysis Breakdown
Motivation, Derivation10. Feed-forward Neural Networks
Layers, definitions, Kernel Trick11. Gradient Based Optimizations Under The Deep Learning Lens
Stochastic Gradient Descent, Cost Function, Maximum Likelihood12. Output Units For Deep Learning
Stochastic Gradient Descent, Cost Function, Maximum Likelihood13. Hidden Units For Deep Learning
Activation Functions, Performance, Architecture14. The Common Approach to Binary Classification
The most generic way to setup your deep learning models to categorize movie reviews15. General Architectural Design Considerations for Neural Networks
Universal Approximation Theorem, Depth, Connections16. Classifying Text Data into Multiple Classes
Single-Label Multi-class Classification17. Convolutional Models Overview
Convolutions, Kernels, Downsampling & Properties18. Working Understanding of Convolutional Models
Creating, Preprocessing, Data Augmentation, Feature Extraction, Fine Tuning19. Convolutional Models for Sequential Data
And easing into Recurrent Neural Networks20. Recurrent Models Overview
Recurrent Layers: SimpleRNN, LSTM, GRU
Up Next…
Coming up next is probably Recurrent Neural Networks and LSTM Layers. If you would like me to write another article explaining a topic in-depth, please leave a comment.
For the table of contents and more content click here.
References
Adams, R. A. (2017). Calculus. Prentice-Hall.
François, C. (2018). Deep Learning with Python and Keras. MITP-Verlags GmbH & Co. KG.
Goodfellow, I. (2017). Deep Learning. MIT Press.
Nicholson, K. (2009). Linear Algebra with Applications.
Sutton, R. S. (2018). Reinforcement Learning. A Bradford Book.
Wackerly, D. D. (2007). Mathematical Statistics with Applications. Belmont, CA: Nelson Education.
(n.d.). A First Course In Linear Algebra — Open Textbook Library. Retrieved February 24, 2020, from https://open.umn.edu/opentextbooks/textbooks/a-first-course-in-linear-algebra-2017