Review of FutureGAN: Predict future video frames using Generative Adversarial Networks(GANs)

Future video frame prediction using GANs

Published in

Analytics Vidhya

4 min readAug 13, 2021

Proposed by Aigner et al., FutureGAN is a GAN based framework for predicting future video frames. Video predictionis the ability to predict future video frames based onthe context of a sequence of previous frames. Unlike static images, videos provide complex transformations and motion patterns in the time dimension. Therefore to accurately predict future video frames the model needs to take care of both the temporal and spatial components. Typically Recurrent Neural Networks are used to model the temporal dynamics. However the authors of FutureGAN proposed the use of Spatio-Temporal 3d-Convolutions in Progressively Growing manner to predict future video frames.

Key ideas of FutureGAN paper:

FutureGAN uses an encoder-decoder GAN model to predict futureframes of a video sequence conditioned on a sequence of past frames.
To capture both the spatial and temporal components of a videosequence, spatio-temporal 3d convolutions are used in all encoderand decoder modules.
It used the existing progressively growing GAN (ProGAN) that achieves high-quality results on generating high-resolution single images.
FutureGAN framework is applicable to various different datasetswithout additional changes with stable performance.

Why spatio-temporal 3d convolutions?

The question is why use spatio-temporal 3d convolutions and not CNN + RNNs to take care of the spatial and time domain.

We know that 2D convolution applied on an image will output an image, 2D convolution applied on multiple images (treating them as different channels) also results in an image. Hence, 2D ConvNets lose temporal information of the input signal right after every convolution operation.
However only 3D convolution preserves the temporal information of the input signals resulting in an output volume. The same phenomena is applicable for 2D and 3D polling as well.

FutureGAN model

Generator Network:

FutureGAN generator consists of an encoder and a decoder part. The authors used progressive growing (ProGAN) method in their encoder and decoder.
The encoder learns a latent representation of the input. This latent representation is used by a decoder to generate the predictions.
All convolutional layers use 3d convolutions. This allows the generator to appropriately encode and decode the input sequence’s spatial and temporal components.
3d convolutions with asymmetric kernel sizes and strides were employed for downsampling, and transposed 3d convolutions with asymmetric kernel sizes and strides were used for upsampling.

Discriminator Network:

The discriminator of our FutureGAN model is designed todistinguish between real and fake sequences.
The discriminator network takes frames from the training set that represent ground truth sequence and frames created by the generator as input.
Aside from the bottleneck layers, the FutureGAN discriminator closely mimics the generator network’s encoder component. There are no pixel-wise feature vector normalisation layers in the discriminator, which is one significant difference.

FutureGAN training:

The authors start by programming the model to take a collection of 4x4px resolution frames and output frames of the same resolution, similar to ProGAN. After a certain number of rounds, layers are gradually added to double the resolution. The resolution of the input frames is always the same as the resolution of the network in its current condition.

Some other key features to stabilise training and prevent mode collapse include:

Weight scaling.
Feature normalization in the generator.
Use of WGAN-GP loss with epsilon penalty.

The loss function is made up of the Wasserstein GAN with gradient penalty (WGAN-GP) loss with an epsilon-penalty term that prevents the loss from drifting. The WGAN-GP loss was selected by the authors to train their model because it improved the quality of the generated frames.

FutureGAN generator during training. — FutureGAN generator during training

Results:

To assess their model the authors conducted experiments on three datasets of increasing complexity which include the MovingMNIST, the KTH Action dataset, and the Cityscapes dataset. The authors offer values for the mean squared error (MSE), peak signal-to-noise ratio (PSNR), and structural similarity index (SSIM) between the ground truth and predicted frame sequence to quantitatively evaluate the models. The reuslts are shown below:

As we can see from the table the FutureGAN model performed better than the fRNN model for the MovingMNIST dataset. However the MCNet model performed better for the KTH action dataset.

FutureGAN results for the MovingMNIST dataset. a: Input, b: Ground Truth, c: FutureGAN, d: fRNN

FutureGAN results for the KTH Action test split. a: Input, b: Ground Truth, c: FutureGAN, d: fRNN, e: MCNet

FutureGAN results for the Cityscapes dataset. a: Input, b: Ground Truth, c: FutureGAN

Conclusion:

Through the FutureGAN paper the authors demonstrated the use of GAN based framework for future video frame prediction without the use of RNNs. The authors showed that the ProGAN model can be used to generate realistic looking future frames. The authors stated that FutureGAN is a highly flexible model that can easily be trained on various datasets of different resolutions without prior knowledge about the data.

Reference:

FutureGAN: Anticipating the Future Frames of Video Sequences using Spatio-Temporal 3d Convolutions in Progressively Growing GANs