Review: MCnet or Motion Content Network for video prediction

Decomposing Motion and Content for Natural Video Sequence Prediction, Villegas et al.

Published in

Analytics Vidhya

4 min readAug 20, 2021

In my previous article I described FutureGAN (link to article) which is a GAN based framework for future frame prediction. The key aspect of FutureGAN is that it did not use any form of RNNs to model the temporal dynamics of videos. Rather it uses Spatio-Temporal 3d-Convolutions to model both the content and the motion in videos. However, we did notice that it was not able to outperform the MCnet model. Therefore in this article we will explore the MCnet model for video prediction.

Video prediction is challenging, since, unlike static images, videos contain complex transformations and motion patterns in the time domain. To deal with this problem, MCnet uses a creative approach of separating the video prediction task into two halves.

MCnet key idea:

MCnet employs a complex method of decomposing motion and content, two key components that generate dynamics in videos.
This is a clever trick as we can treat video prediction as two separate task. First, to predict content of the video or the spatial layout of an image frame in video. Second, to predict the motion in the video or the temporal dynamics.
By independently modeling motion and content, the frame prediction reduces to combining the content and motion features which simplifies the complex task of next frame prediction.

MCnet architecture:

MCnet generator model is built upon the Encoder-Decoder Convolutional Neural Network and Convolutional LSTM.
The Encoder-Decoder Convolutional Neural Network is used to model content.
The Convolutional LSTM is used to model motion or temporal dynamics.
The discriminator is optimized to perform binary classification that is identify fake and real video frames.

To model both the spatial and temporal dynamics, MCnet Generator model comprises of the following networks:

Motion encoder
Content encoder
Multi-Scale Motion-Content Residual
Combination Layers and Decoder

Motion encoder:

The motion encoder captures the temporal dynamics of the scene’s components and is used to model the motion in the video.
CNN is used for the encoder along with ConvLSTM

Content encoder:

The content encoder extracts important spatial features from a single frame, such as the spatial layout of the scene and salient objects in a video.
It is implemented by a Convolutional Neural Network (CNN) that specializes on extracting features from single frame.

Multi-scale motion-content residual:

To prevent information loss after the pooling operations in the motion and content encoders, residual connections are used.
The residual connections in the network communicate motion-content features at every scale into the decoder layers after unpooling operations.

Combination layers and Decoder:

The outputs from the two encoder pathways encode a high-level representation of motion and content, respectively. Given these representations, the objective of the decoder is to generate a pixel level prediction of the next frame.
To this end, it first combines the motion and content back into a unified representation by concatenating the feature vectors in the depth dimension.

MCnet Algorithm overview:

Model training:

Results:

In the figure above we can see that the MCnet + RES model (residual connections) performs the best out of all the other models for the KTH and Weizmann dataset. One interesting thing to note is that MCnet without RES performs worse than Conv LSTM + RES model for the KTH dataset. This demonstrates the necessity of using RES connections in the MCnet model since it avoids the vanishing gradient problem by communicating motion-content features, thus preventing information loss.

As we move forward in time, the SSIM and PSNR values drop dramatically due to increased uncertainty in predicting the spatial and temporal dynamics. This is also depicted in the diagram below.