Video Frame Prediction with Keras

Video Frame Prediction

Published in

Machine Learning Basics

7 min readDec 27, 2018

Video Frame prediction is an application of AI which involves predicting the next few frames of a video given the previous frames. For example, a video frame predictor can be shown several movies of a specific genre, such as romance movies or action thrillers. The video frame predictor can learn the probability distribution of the frames from the second half of a movie given the frames from the first half. This means that if you give a trained video frame predictor the frames from the first half of a random romance or action thriller movie, it will be able to predict how the second half of the movie unfolds.

Thus, a video frame predictor requires 2 two types of intelligence to solve its task. One is Computer Vision to interpret each frame of the movie and the other is Sequence Modelling to understand the sequence of frames in the movie.

I will now demonstrate how a video frame predictor can be built and trained using Keras with a Tensorflow backend on Python (I’m using Tensorflow 1.8 and Python 3.6). I will also show some sample output (prediction of next video frames given the previous ones) from the trained video frame predictor. A video of moving squares will be used in this demonstration. The code can be found here.

Code Explanation

from keras.models import Sequential
from keras.layers.convolutional import Conv3D
from keras.layers.convolutional_recurrent import ConvLSTM2D
from keras.layers.normalization import BatchNormalization
import numpy as np
import pylab as plt

Imports all the required libraries and API. I will explain the meaning and use of Sequential, Conv3D, ConvLSTM2D and BatchNormalization later.
Numpy package is for numerical programming in Python and Pylab package is for graphics and animation.

# We create a layer which take as input movies of shape
# (n_frames, width, height, channels) and returns a movie
# of identical shape.

seq = Sequential()
seq.add(ConvLSTM2D(filters=40, kernel_size=(3, 3),
input_shape=(None, 40, 40, 1),
padding=’same’, return_sequences=True))
seq.add(BatchNormalization())

seq.add(ConvLSTM2D(filters=40, kernel_size=(3, 3),
padding=’same’, return_sequences=True))
seq.add(BatchNormalization())

seq.add(Conv3D(filters=1, kernel_size=(3, 3, 3),
activation=’sigmoid’,
padding=’same’, data_format=’channels_last’))
seq.compile(loss=’binary_crossentropy’, optimizer=’adadelta’)

This code segment builds a sequential model in Keras. This means that the model is formed by stacking one neural network on top of another repeatedly. This means that the output of one layer is input for the next layer. Many useful ML models can be built using Sequential().
In this model we stack many ConvLSTM2D layers one on top of another. This layer is the same as the classic LSTM layer in every respect except for the fact that the input and recurrent transformations are both 2 dimensional convolutional transformations (instead of the usual linear transformations used in LSTMs which involve multiplication of inputs and states by respective weight matrices). This gives our model the ability to interpret visual information in the form of 2D images and the ability to understand time sequences (sequential data). This is ideal for video frame prediction. I have explained Convolutional Neural Networks and Recurrrent Neural Networks (including LSTMs) in detail in another blog.
Each ConvLSTM2D layer is followed by a BatchNormalization layer. Batch Normalization is used to change the distribution of inputs to the next layer. For example, the inputs to a layer can be made to have mean 0 and variance 1. This has been shown to be able speed up the training process (quicker convergence) and let separate layers learn more independently of other layers. For example, if one input feature to the network has a range of -1 to 1 and another has a range of 1 to 1000, the features should be normalized so that the amount by which they vary with respect to each other (covariance shift) is reduced. Thus, if the range and scale of the hidden unit values of a layer are brought closer, their covariance shift will come down and learning a better combination of these features will be easier. BatchNorm use the mean and variance of the values in a hidden layer as parameters. We initialize these parameters to the actual mean and variance of the layer. But, these mean and variance parameters can adjusted during Back-Propagation in such a way that the network can learn thee required mapping correctly. Because the input values are shifted and scaled, the learning of one layer’s parameters is independent of the scale of the activations of neurons in the previous layers. Thus, we use BatchNormalization to quicken training and improve each layer’s independent learning, similar to how dropout helps each neuron to learn independently. It also acts as a regularizer to some extent.
We finally use a Conv3D layer to extract important visual features from the outputs of our ConvLSTM2D layers (we use these visual features to produce future video frames) and use a sigmoid activation function to output video frames with pixels with brightness between 0 and 1 (grayscale; 0 — white and 1 — black). Conv3D is basically a convolutional layer that takes 3D feature maps and uses 3D kernels to produce new 3D feature maps containing low-level features extracted from its inputs.
We use binary cross entropy loss function (because the pixels are 0s ans 1s) and the Adadelta optimizer.

# Artificial data generation:
# Generate movies with 3 to 7 moving squares inside.
# The squares are of shape 1x1 or 2x2 pixels,
# which move linearly over time.
# For convenience we first create movies with bigger width and height (80x80)
# and at the end we select a 40x40 window.

def generate_movies(n_samples=1200, n_frames=15):
row = 80
col = 80
noisy_movies = np.zeros((n_samples, n_frames, row, col, 1), dtype=np.float)
shifted_movies = np.zeros((n_samples, n_frames, row, col, 1),
dtype=np.float)

for i in range(n_samples):
# Add 3 to 7 moving squares
n = np.random.randint(3, 8)

for j in range(n):
# Initial position
xstart = np.random.randint(20, 60)
ystart = np.random.randint(20, 60)
# Direction of motion
directionx = np.random.randint(0, 3) — 1
directiony = np.random.randint(0, 3) — 1

# Size of the square
w = np.random.randint(2, 4)

for t in range(n_frames):
x_shift = xstart + directionx * t
y_shift = ystart + directiony * t
noisy_movies[i, t, x_shift — w: x_shift + w,
y_shift — w: y_shift + w, 0] += 1

# Make it more robust by adding noise.
# The idea is that if during inference,
# the value of the pixel is not exactly one,
# we need to train the network to be robust and still
# consider it as a pixel belonging to a square.
if np.random.randint(0, 2):
noise_f = (-1)**np.random.randint(0, 2)
noisy_movies[i, t,
x_shift — w — 1: x_shift + w + 1,
y_shift — w — 1: y_shift + w + 1,
0] += noise_f * 0.1

# Shift the ground truth by 1
x_shift = xstart + directionx * (t + 1)
y_shift = ystart + directiony * (t + 1)
shifted_movies[i, t, x_shift — w: x_shift + w,
y_shift — w: y_shift + w, 0] += 1

# Cut to a 40x40 window
noisy_movies = noisy_movies[::, ::, 20:60, 20:60, ::]
shifted_movies = shifted_movies[::, ::, 20:60, 20:60, ::]
noisy_movies[noisy_movies >= 1] = 1
shifted_movies[shifted_movies >= 1] = 1
return noisy_movies, shifted_movies

These lines of code artificially generate data. The data incldes movies with frames with 3 to 7 moving squares inside it (that move at a constant speed either leftwards or rightwards). We generate frames with a random number of squares (3 to 7) of random sizes (either 1 by 1 or 2 by 2 pixels). We group these frames into movies (sequences of frames). We first create frames of size 80 by 80 for convenience. Later, we shrink these frames to size 40 by 40.
We try to make the model more robust by adding noise to the movie frames. The idea is that if during inference, the values of the pixels in the squares is not exactly one, we need to train the network to be robust and still consider these pixels as those belonging to a square. The noisy movies act as inputs to ConvLSTM. We also create some future frames which follow the same pattern as the previous frames

# Train the network

noisy_movies, shifted_movies = generate_movies(n_samples=1200) seq.fit(noisy_movies[:1000], shifted_movies[:1000], batch_size=10, epochs=300, validation_split=0.05)

We train the network on 95 percent of the generated data and reserve 5 percent for validation purposes. This shows how well our model is performing on unseen data. (we use 300 epochs and a batch size of 10)

# Testing the network on one movie
# feed it with the first 7 positions and then
# predict the new positions
which = 1004
track = noisy_movies[which][:7, ::, ::, ::]

for j in range(16):
new_pos = seq.predict(track[np.newaxis, ::, ::, ::, ::])
new = new_pos[::, -1, ::, ::, ::]
track = np.concatenate((track, new), axis=0)

# And then compare the predictions
# to the ground truth
track2 = noisy_movies[which][::, ::, ::, ::]
for i in range(15):
fig = plt.figure(figsize=(10, 5))

ax = fig.add_subplot(121)

if i >= 7:
ax.text(1, 3, ‘Predictions !’, fontsize=20, color=’w’)
else:
ax.text(1, 3, ‘Initial trajectory’, fontsize=20)

toplot = track[i, ::, ::, 0]

plt.imshow(toplot)
ax = fig.add_subplot(122)
plt.text(1, 3, ‘Ground truth’, fontsize=20)

toplot = track2[i, ::, ::, 0]
if i >= 2:
toplot = shifted_movies[which][i — 1, ::, ::, 0]

plt.imshow(toplot)
plt.savefig(‘%i_animate.png’ % (i + 1))

We test the model on a new movie in this part of the code. We feed it with the first 7 frames and ask it to predict the next 7 frames.

The testing phase concludes this code and this tutorial. I will publish blogs on CNNs and LSTMs soon, because these are some of the most important concepts in modern AI. These blogs may also help in understanding this blog. I would like to thank the contributors to the code used in this tutorial for making this post possible.

Video Frame Prediction with Keras

Video Frame Prediction

Code Explanation

Written by Tarun Paparaju