An introduction to ConvLSTM

Note: a Portuguese version of this article is available here


Nowadays it is quite common to find data in the form of a sequence of images. The most typical example is video at social networks such as YouTube, Facebook or Instagram. Other examples are:

  • Video calls
  • Movies and trailers
  • Satellites pictures
  • Security cameras

This article will introduce how to use sequences of images as input to a neural network model in a classification problem using ConvLSTM and Keras.

ConvLSTM theory

Data collected over successive periods of time are characterised as a Time Series. In such cases, an interesting approach is to use a model based on LSTM (Long Short Term Memory), a Recurrent Neural Network architecture. In this kind of architecture, the model passes the previous hidden state to the next step of the sequence. Therefore holding information on previous data the network has seen before and using it to make decisions. In other words, the data order is extremely important.

A LSTM cell

When working with images, the best approach is a CNN (Convolutional Neural Network) architecture. The image passes through Convolutional Layers, in which several filters extract important features. After passing some convolutional layers in sequence, the output is connected to a fully-connected Dense network.

Convolution of an image with one filter

In our case, sequencial images, one approach is using ConvLSTM layers. It is a Recurrent layer, just like the LSTM, but internal matrix multiplications are exchanged with convolution operations. As a result, the data that flows through the ConvLSTM cells keeps the input dimension (3D in our case) instead of being just a 1D vector with features.

A ConvLSTM cell

A different approach of a ConvLSTM is a Convolutional-LSTM model, in which the image passes through the convolutions layers and its result is a set flattened to a 1D array with the obtained features. When repeating this process to all images in the time set, the result is a set of features over time, and this is the LSTM layer input.

Keras's ConvLSTM layer

From now on, the data format will be defined as "channels_first". This results on images having the format (channels, rows, cols). A colourful 300x300 pixels picture has 3 channels, one for each primary colour (Red, Green, Blue) and the format (3, 300, 300). If it was "channels_last", the keras default for convolutional layers, the format would be (rows, cols, channels).

ConvLSTM layer input

The LSTM cell input is a set of data over time, that is, a 3D tensor with shape (samples, time_steps, features). The Convolution layer input is a set of images as a 4D tensor with shape (samples, channels, rows, cols). The input of a ConvLSTM is a set of images over time as a 5D tensor with shape (samples, time_steps, channels, rows, cols).

ConvLSTM layer output

The LSTM cell output depends on the return_sequences atribute. When set True, the output is a sequence over time (one output for each input). In this case, the output is a 3D tensor with shape (samples, time_steps, features). When return_sequences is set False (the default), the output is the last value of the sequence, that is, a 2D tensor with shape (samples, features).

The Convolution layer output is a set of images as a 4D tensor with shape (samples, filters, rows, cols).

The ConvLSTM layer output is a combination of a Convolution and a LSTM output. Just like the LSTM, if return_sequences = True, then it returns a sequence as a 5D tensor with shape (samples, time_steps, filters, rows, cols). On the other hand, if return_sequences = False, then it returns only the last value of the sequence as a 4D tensor with shape (samples, filters, rows, cols).

Other parameters

The other ConvLSTM atributes derivates from the Convolution and the LSTM layer.

From the Convolution layer, the most important ones are:

  • filters: The number of output filters in the convolution.
  • kernel_size: Specifying the height and width of the convolution window.
  • padding: One of "valid" or "same".
  • data_format: Images format, if channel comes first ("channels_first") or last ("channels_last").
  • activation: Activation function. Default is the linear function a(x) = x.

From the LSTM layer, the most important ones are:

  • recurrent_activation: Activation function to use for the recurrent step. Default is hard sigmoid (hard_sigmoid).
  • return_sequences: Whether to return the last output in the output sequence (False), or the full sequence (True). Default is False.

Practical example

The goal is to identify which genre appears during the trailer of a movie. To simplify, the possible genres are comedy, action, terror, thriller, musical or other, a total of 6 categories. As an example, if the trailer scene has an explosion, it should be classified as action. If the next scene has a creepy clown, it should be classified as terror.

Model architecture

The model has a single input, the trailer frames in sequence, and 6 independent outputs, one for each category.

The model input is a video. Its format will be (samples, frames, channels, rows, cols). Limiting the frames by 1000 per sample, and the image as a 3 channel 400x400 pixels picture, the input final format is (samples, 1000, 3, 400, 400). Samples are the number of available trailers for training.

In this example, we will use return_sequences = True, so the output should be (samples, frames, categories), but since the model has six independent outputs, the output is (categories, samples, frames, 1), more accurately (6, samples, 1000, 1). The impact of using return_sequences is that the model will classify each frame in one category.

The architecture starts with two ConvLSTM layers, each one followed by a BatchNormalization and a MaxPooling. In sequence, it splits in branches, one for each category. All the branches are the same, they start with one ConvLSTM layer followed by a MaxPooling. Then, this output is connected to a fully-connected Dense network. Finally, the last layer is a Dense with a single cell. The next image illustrates the simplified model with only two categories (comedy and other).

Simplified architecture with only two categories used (comedy and other)

The model code:


In practice, uploading several videos to RAM in order to feed the model all at once is just not viable. Such memory problem is quite common when working with images. A solution is to use a generator, it allows us to feed the model sample by sample using .fit_generator() instead of feeding all at once with .fit().

Assuming that the Trailers dataset is preprocessed and saved one trailer by file in the format trailer_{id}.npy, with id being a number between 0 and num_trailers:

Notice that the model has six independent outputs, and each output expects a 3D tensor with shape (samples, time_steps, 1), which is (1, 1000, 1).

The generator can be defined as:

Notice that the generator receives as input a list of available trailer ids for training and, one by one, loads the files and return them as (input, expected_output) to train the model by one sample.

Then, we just need to train the model: