Video Pixel Networks in Tensorflow

Recently DeepMind published a paper called Video Pixel Networks (VPN), it’s probabilistic Video model that estimates the discrete joint distribution of the raw pixel values in a video. There’s no open-source implementation of this paper released from DeepMind. This blog post is to describe the VPN architecture and the details of my implementation of this paper. You can find it here: VPN in TF.

The VPN Architecture

The authors proposed a generative video model based on deep neural networks, that can model the joint distribution of the pixel values in a video. The model encodes the video frames and capture dependencies in the time dimension of the video, in the two space dimensions of each frame and in the color channels dimension. This makes it possible to model the stochastic transition from one pixel to the next without introducing any uncertainty caused by independence assumptions.

The architecture of the VPN consists of two parts: resolution preserving CNN encoders and PixelCNN decoders. The CNN encoders preserve at all layers the spatial resolution of the input frames. The outputs of the encoders are combined over time with a convolutional LSTM that also preserves the resolution. The PixelCNN decoders use masked convolutions to efficiently capture space and color dependencies and use a softmax layer to model the multinomial distributions over raw pixel values. The network also utilizes newly defined multiplicative units and corresponding residual blocks.

Their main contributions in this paper was the ability to model the probability distribution of each pixel without introducing any independence assumptions. Making each pixel dependent on all the previous frames and also on the predicted pixels on the current frame. They do so by feeding the current frame containing all the predicted pixels in this frame so far into the model when predicting the current pixel. And then sampling from the predicted distribution for the current pixel to get a new current frame to use for predicting the next pixel. So at inference time generating a video tensor requires sampling T ∗ N^2 ∗ 3 variables, which for a second of video with resolution 64 × 64 is in the order of 10 power 5 samples. But at training time we can use the masked convolutions to instead of feeding the current frame to the network for each pixel, I will describe it more briefly in a later section.

(Fig. 1)

Resolution Preserving CNN Encoders

Given a set of video frames F0, …, FT , the VPN first encodes each of the first T frames F0, …, FT −1 with a CNN Encoder to predict FT. Each of the CNN Encoders is composed of k (k = 8 in the original VPN architecture) residual blocks. All these encoders preserve the spatial resolution of the input frames.

PixelCNN Decoders

The second part of the VPN architecture computes dependencies along the space and color dimensions. The outputs of the CNN Encoders provide informations about the previous frames. The pixelCNN Decoders fuse these informations with the current frame to make each pixel dependent on the previous pixels. They do so by using /masked convolutions/. PixelCNNs are composed of l resolution preserving residual blocks (l = 12 in the original VPN architecture).

Multiplicative Units

The authors introduced the Multiplicative Units in their paper which is constructed by incorporating LSTM-like gates into a convolutional layer (Fig. 2).

(Fig. 2)

Residual Multiplicative Blocks

To allow for easy gradient propagation through many layers of the network, they stack two MU layers in a residual multiplicative block (Fig. 3) where the input has a residual (additive skip) connection to the output. For computational efficiency, the number of channels is halved in MU layers inside the block by using 1x1 convolution layers.

(Fig. 3)

My Implementation Details

To implement this architecture I needed to do some tricks because there a lot of implementation details they didn’t mention in the paper.

The Masked Convolutions

In the paper they mentioned that they are using the masked convolutions in the PixelCNN Decoders. They didn’t give any details about how they used it. But as they fuse the current frame in the PixelCNN Decoders so I assumed they are using it in this fusion to force every pixel to be dependent only on the pixels predicted before it. These masked convolutions actually allows us to train the VPN much faster by feeding all the ground truth frame as the current frame at training time. And the masked convolutions insure that every pixel doesn’t depend on its label in the ground truth. The masked convolution can be implemented easily by zeroing the parts in the convolution filters that we want to mask.

The Mini-VPN and The Micro-VPN

While the proposed VPN architecture is the state-of-the-art architecture for video modeling, the size of the model makes it hard to train and an overkill for small domains. So I tried few variations of this network, reducing the number of filters in the convolution layers and the number of the residual blocks in the encoder and the decoder.

The Mini-VPN:

This model contains half the number of the residual blocks in both the encoder and the decoder. And contains 1/4 of the filters used on the original VPN architecture.

The Micro-VPN:

This model contains 1/4 of the number of the residual blocks in both the. encoder and the decoder. And contains 1/4 of the filters used on the original VPN architecture.

However, my implementation hasn’t been trained and tested on the full Moving MNIST dataset because of the lack of the computation power. It has been overfitted on one sequence to insure the correctness of the implementation. The overfitting curves of the different architectures (Fig. 4) shows that the original architecture was the fastest to converge (needed only 600 iterations). The second was the Micro-VPN architecture (needing 3K iterations). The last was the Mini-VPN (needing 5K iterations). Both the Mini-VPN and the Micro-VPN architectures got stuck in a local optima in a stage of the training before convergence, while the original VPN architecture was able to converge easily without getting stuck in the middle.

(Fig. 4)

Final Note

I can’t say that my implementation is able to reproduce the results they produced in the paper because of the lack of computation power. I will be glad if someone tried fully training it and shared the results.