# Intro

When learning a new task, humans rely on the prior experience and knowledge collected throughout their life. We know what objects can appear in what environments and how they can move and interact with each other. If we can give an intelligent agent the same prior knowledge that we have, it may be able to learn how to act more intelligently and to learn how to do this more quickly.

Humans naturally observe large quantities of video data. Over the course of our lives, we acquire the ability to reason about the future and to infer future motion and actions of other animals and objects. The corresponding computational task that may allow to perform such reasoning is video prediction. In this task, the model observes a number of frames “in the past” and needs to generate a number of frames “in the future”. While we can study this computational task independently of other considerations, we also expect that improvements in it will lead to improvements in representation learning ([25, chapter 15]).

While recent results [13,15,18,20,21] in video prediction show promise, the challenge of generating diverse, long-term, realistic videos is far from solved. In this post I will evaluate the quantitative metrics that are used for training and evaluating future prediction models. I will show when current approaches fail and why a particular metric (the mean squared error) might cause this. I will also suggest several ways the problems may be addressed in the future.

This post comes after my first experience with video prediction, working on a paper lead by Drew Jaegle [21], where we encountered these problems over the course of our research. Here, I want to explain these problems, summarize our observations, and help other researchers in the area.

# 1. 1. MSE problems

The mean squared error (MSE) loss (footnote 1) on predicted pixels, sometimes improperly called the L2 loss, is common and used at least as one component of the loss of most current models. In fact, it is still used as the only training loss in several state-of-the-art models, e.g. [14, 16, 19, 21].

If I1[i,j] is the i,jth pixel of the target image, and I2[i,j] the i,jth pixel of the predicted image, the MSE loss is given by:

The L2 metric is defined as the square root of MSE:

These two losses share a range of properties, and I will use the two interchangeably. The main difference is that L2 satisfies the triangle inequality and thus is a true metric, whereas MSE is not. Using L2, it makes sense to talk about distance between two images.

Do models trained using only an MSE pixel loss adequately capture the image qualities we care about? Here, I present three simple experiments that explore different undesirable qualities of the MSE loss. For the first two experiments it is helpful to consider the task of image-to-image translation as a simplification of future prediction (prediction of one frame into the future). In fact, most of my conclusions extend to any problem that is commonly formulated with MSE loss.

# 1.1.1. MSE does not measure what we want

In this experiment, the image is perturbed (a) with different kinds of noise, in such a way that the perturbed images have the same MSE. While image (d) has strong visible artifacts, the change in the image (b) is almost imperceptible; however, both (b) and (d) have the same MSE with respect to the target (a). Clearly, the MSE metric does not perfectly align with our understanding of natural images.

Imagine a model that, with a target (a), outputs image (b), and another that outputs (d). Both models will observe the same magnitude of loss, while yielding qualitatively different results. Arguably, in a variety of settings the property that we care about is generation of plausible-looking results, instead of minimizing the MSE loss. In those cases, should we look for a loss that is lower when the model returns image (b) than when it returns image (d)?

# 1.1.2. MSE fails to make good compromises

To dig deeper into this experiment, let’s consider how the model that outputs image (b) will “correct” itself over the course of training. Somewhat simplistically (footnote 2), let’s assume that the model has no output bias, so that the correction is always ideal with respect to the loss. The trajectory of the model output will then look like the top row of the figure below.

The top sequence shows the trajectory with respect to the MSE loss that the model output has as it learns to produce the correct image. The trajectory will simply consist of the linear interpolation of pixels. We see that while the image at the end of trajectory is realistic, the model spends most of its time outputting blurry images. In practice, there is often not enough data, and the models themselves do not have the network capacity to achieve zero loss. Whenever this holds, we can safely assume that most models optimizing MSE will produce blurry images even after convergence.

MSE loss yields intermediate images of poor quality, even though there is a natural sequence of images that converges to the desired target while remaining realistic. Those are simply images shifted by one pixel each, and the sequence is depicted in the bottom row of figure above. While a loss that produces trajectories like this is not known, it would be of great use as it would perhaps produce more realistic images when applied as loss.

In the next sections, I will discuss the properties of the MSE loss that give rise to this inability to produce realistic images in setups with insufficient network capacity, and what steps might be taken to correct it.

# 1.1.3. The uncertain future

The specific property of video prediction task is that there rarely exists a unique solution for it — nobody can perfectly predict the future. How will our models trained with MSE behave in the case when there is uncertainty in the future? Babaeizadeh et al. [6] consider a simple non-deterministic setup where a geometric object starts moving in a random direction. Models trained with MSE, such as the one depicted below, tend to output the mean image, as it is the one that minimizes the MSE loss. However, even in this simple case, the mean image is not a valid image — it is a blurry superposition of all possible futures.

# 1.2. Where do these properties come from and how we can fix them?

We see that the MSE loss has a number of unintuitive properties that we do not want the trained model to inherit. As it turns out, there are good reasons for these properties.

# 1.2.1. The problem of expectations

Under uncertainty, MSE loss will produce the mean image of all possible futures, as the mean is the global optimum. However, as we have seen before, the expected value of the true distribution of possible future images in pixel space might actually have a very low probability, i.e. it might not be a real image! If we want the generated image to appear realistic, we would be happier with the mode of the distribution of future images, or, at the very least, some point sampled from the true distribution. In the previous example, a point sampled from the distribution would be one of the eight possible trajectories, without blur.

One way to achieve this is by using an adversarial loss. Generative Adversarial Networks (GANs)[3] do not suffer from producing the mean image, as it is easily distinguishable from the true images. This fact is intricately connected to the fact that GANs optimize Jensen-Shannon divergence, as opposed to reverse Kullback-Leibler divergence ([3], Theorem 1).

Besides the instabilities of the GAN training procedure, a problem with the naive application of GANs is that the generated motion might become detached from the motion in the observed frames. While the basic GAN setup can produce realistic single images, it never encourages a sequence of such images to look like a realistic sequence. For example, each image of the generated sequence might contain different objects or a person performing different kinds of motion. Applying an additional MSE loss might alleviate this problem by grounding each generated frame with the observed frame, but that gets us back to the problems inherent with the MSE loss. An underexplored alternative to the use of MSE is to let the GAN discriminator see the past sequence as well as the predicted sequence. The adversarial loss will then naturally force the model to generate motion that matches the observed one.

On a side note, an alternative solution would be to model the whole range of possible futures, a possibility explored in a very nice recent work by Denton and Fergus [14]. The proposed model is trained using a procedure that encourages it to resolve the uncertainty about the next frame. At test time, the distribution of possible next frames is explicitly predicted by the model.

# 1.2.2. The problem of Gaussians

The second problem of the MSE loss is more fundamental. An important fact from signal processing tells us that assuming that the data is corrupted with multivariate uncorrelated Gaussian noise, MSE loss recovers the optimal signal ([23], Section 3.1.1). In our case, the “noise” corrupting the videos consists of the factors that the model is unable to learn, or that are truly undetermined by the observed video. In particular, it is often hard for the models to learn small detail as facial features or precise hand motion, as in the figure below. Since the model is unable to learn such details, it will effectively treat them as noise.

The “noise” corrupting the images in the top row of the figure above is strongly correlated in space and time. The pixels corresponding to the person’s trousers could be all grey or all black, but are unlikely to be mixed. This shows that assuming uncorrelated Gaussian distribution of noise in pixel space is completely unrealistic. We want a noise model that naturally suits the kinds of correlated variations that natural images exhibit.

An alternative way to view this problem comes up in the variational autoencoder (VAE, [22, 24]). There, we are trying to increase the conditional probability of the output given the latent. Similar to the previous argument, it turns out that the MSE loss gives the optimal solution when we assume that the distribution over possible futures is an uncorrelated Gaussian. This assumption again fails to match real distributions. For example, if the camera starts moving in the video, we might see a new object entering the scene or we might not. There is no middle ground where we see parts of the object or the object blurred with the background — pixels on the new object are correlated, and should be considered by the loss as such.

The figure above gives a good example of a model being unable to fit all the data. When this happens, the Gaussian assumption leads to solutions that are not expected by humans. In this case, the model does not fit the motion of a person that is similar in color to the background (bottom row in the figure above), as this yields less loss than spending capacity to fit these cases. If we had a loss function invariant to such changes, meaning that the model receives the same amount of loss for incorrect motion no matter what the background is, this would be avoided.

As a side note, a useful concept for understanding what could be a suitable metric comes from the mathematical Lie theory. If we think about the space of natural images as a smooth manifold, all we need to do is find a local linearization or a chart of it that would allow us to define a metric. For an example of metrics on non-trivial manifolds, see [12].

# 1.2.3. Alternatives to MSE

One option that avoids some of these problems is a perceptual loss [4], which is an MSE loss defined in the feature space produced by a network trained on an appropriate task, such as object classification. The use of such a network allows us to define a loss in a space that is perhaps more suitable than pixel space, in that it is trained to capture relevant information about the image in a way that allows linear classification. This might force the space to assume the form of a linearization of the image manifold. There is indirect support for this explanation in works such as [9], where the authors show that linear interpolations along certain directions in the feature space can indeed lead to meaningful images.

Alternatively, a perceptual loss may force the model to focus on the parts of the image that are useful for object classification. As human perception naturally focuses on the salient objects in the scene, we may be able to produce realistic images more easily by modeling the process of finding salient objects. If the end goal of the model is to produce images that will be viewed by humans, it makes sense to specifically search for a metric that closely aligns with human judgement.

Not all uses of predictive models are centered on human perception. However, we can adapt the metric depending on the use case that we imagine for the model. A simple idea that might already work is to train specific loss networks. If the goal of a generative model is to produce reasonable human motion, as in [20], we can train the loss network to detect human pose. If the goal, as in [18], is to predict the outcome of robotic manipulation, it makes sense to train the loss network for object detection. This will ensure that the loss that the model receives focuses on the parts of the output that are important for the task.

In the next section, I will look at how the phenomena we have examined here, namely (1) the assumption uncorrelated of Gaussian distribution and (2) defining the metric in pixel space, influence how video prediction models are benchmarked.

TL;DR — they limit our ability to properly evaluate models.

# 2. Evaluation

While training losses deserve our attention, evaluation is arguably a more pressing problem for the video prediction community. Without good evaluation practices, it is easy to be deceived that a particular method or philosophy is better than another, while the evidence is in fact insufficient.

# 2.1 The devil is in the (pixel-level) details

The most commonly used evaluation metrics are PSNR and SSIM. PSNR is a slight modification of MSE that has logarithmic scale, is inverted, and rescaled:

PSNR has the same problems MSE does, and a model optimized for MSE will perform well on the PSNR metric. For instance, consider these generated sequences:

While the motion in the bottom sequence captures perceptual features slightly better, the biggest difference in PSNR comes from the background. The bottom model uses multi-layer residual connections, which allows the model to closely match the pixels of the background. For this reason, we ended up using the bottom model in our work [21].

We have seen that modeling background matters for the relatively simple KTH actions dataset with just four different backgroung types. UCF101, another dataset of videos from YouTube, has much more challenging statistics in terms of both the objects and the motion, and it is much harder for the models to learn any predictions on this dataset. For UCF101, there is a yet more striking experiment that Villegas et al. present [13]. Consider the following sequences:

It turns out that a simple baseline, which generates a video by copying last observed frame, is comparable in quality to most recent models as assessed by PSNR. The reason is, again, the fidelity of the background — the sequence on the bottom exactly matches all pixels that do not experience motion, which is exactly as good under the PSNR metric as the model that faithfully tries to produce a realistic video.

These two examples show two kind of architectures that lead to superior evaluation scores. By allowing the model to copy pixels from previous images, greater detail can be achieved, while including an explicit foreground-background segmentation makes it possible to model every future pixel on stationary objects with absolute precision.

Currently, all state of the art models use these methods in their architectures to some degree. While I cannot argue whether this is good or bad, I would like to see more progress on modeling the key part that defines video data — the motion. It seems wrong to me that the metrics that we, as a community, evaluate our models on, do not adequately reflect the motion quality.

In the next sections, I will describe the existing and potential alternatives to PSNR.

# 2.2. How not to fix it

For the sake of completeness, I need to mention the second commonly used quantitative metric, SSIM [5]. Motivated by shortcomings of PSNR, it is used to complement it and most recent works indeed show both metrics.

SSIM operates on patches, rather than individual pixels, and compares both their first- and second-order statistics. While more faithful to human perception than PSNR, SSIM still suffers from similar problems. In particular, it still assumes the independence of individual patches, and consequently it is unable to incorporate context beyond the patch window. Indeed, we observed that most often the model that performs better on PSNR will also perform better on SSIM and vice versa. As such, SSIM sometimes provides more understanding of the model performance, but, even combined with PSNR, it falls short from providing comprehensive assessment of a model’s performance.

Mathieu et al. [2] propose an interesting evaluation procedure, where only regions of the image that contain motion contribute to evaluation metric. Regions that have low optical flow are masked out. While this metric does focus on motion instead of background, it still uses MSE for evaluating motion. As was shown in the Training section, MSE loss does not always correspond to human judgment.

For the reasons outlined above, other work (e.g. [15]) has used Inception score as a metric. Inception score was introduced in [7] as a metric for generative models and measures how diverse the generated samples are as well as whether in each particular sample the sequence has a clear motion class. However, Inception score has also problems of its own, some of are reported in [8]. Moreover, Inception score relies on an additional trained classification model, which is used as a reference model. Thus, like a perceptual loss, the Inception score is only as reliable as the reference model. This makes it considerably harder to interpret such metrics.

# 2.3. How to fix it

As it seems that no general solution is adequate for the evaluation of video prediction, the solution that I see is to use case-specific metrics instead. In fact, for other tasks, such as image-to-image translation, some recent works have largely adopted perceptual experiments with human observers, e.g. using MTurk platform, as the sole metric [11]. MTurk experiments are the natural choice to make if the model task is to produce images that look realistic to humans.

While MTurk experiments are great for this purpose, many more possible uses exist for video prediction. While being a general unsupervised learning task, in practice it is often used for representation learning with a distinct ultimate task in mind. This is captured by many researchers with transfer learning experiments where a model is fine-tuned on a new case-specific task [13, 15]. This solution to evaluation is the one that I am arguing for when the performance on this task is the goal.

While none of these two solutions are new, the trend in many recent papers is not to include MTurk nor transfer learning experiments. Hopefully, this post can make a case for increased usage and focus on these evaluations as opposed to PSNR and SSIM.

In a yet more involved setup, as in the work of Chelsea Finn and colleagues [17,18,19], there is a clear well-defined task that the video prediction model will be ultimately used for. Specifically, it can be used for a general purpose robotic control tasks. It seems natural to evaluate the model with this direct signal of robotic control performance.

# Summary

In summary, I am calling for operationalizing evaluation using well-defined tasks whenever possible, as this is currently the only way for sound evaluation. As an example, if video synthesis is the task that the model will be used for, the corresponding evaluation is through human observers. As different methods may excel on different metrics, I would prefer to see more research with a clear metric in mind, as [19]. While there is work to be done for general-purpose video prediction, structuring research with the ultimate purpose in mind will ensure an efficient design process for achieving the goals we care about.

# References

[1] Lucas Theis, Aäron van den Oord, Matthias Bethge: A note on the evaluation of generative models. International Conference on Learning Representations (ICLR) 2016.

[2] Michael Mathieu, Camille Couprie, Yann LeCun: Deep multi-scale video prediction beyond mean square error. International Conference on Learning Representations (ICLR) 2016.

[3] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio: Generative Adversarial Nets. Advances in Neural Information Processing Systems (NIPS), 2014.

[4] Justin Johnson, Alexandre Alahi, Li Fei-Fei: Perceptual Losses for Real-Time Style Transfer and Super-Resolution. European Conference on Computer Vision (ECCV), 2016.

[5] Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, Eero P. Simoncelli: The SSIM Index for Image Quality Assessment. IEEE Transactions on Image Processing, 2004.

[6] Mohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, Roy Campbell, Sergey Levine: Stochastic Variational Video Prediction. International Conference on Learning Representations (ICLR) 2018.

[7] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen: Improved Techniques for Training GANs. Advances in Neural Information Processing Systems (NIPS), 2016.

[8] Shane Barratt, Rishi Sharma: A Note on the Inception Score. arXiv preprint arXiv: 1801.01973, 2018.

[9] Paul Upchurch, Jacob Gardner, Geoff Pleiss, Robert Pless, Noah Snavely, Kavita Bala, Kilian Weinberger: Deep Feature Interpolation for Image Content Changes. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

[10] Alex Lamb, Vincent Dumoulin, Aaron Courville: Discriminative Regularization for Generative Models. DeepVision workshop at IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[11] Qifeng Chen, Vladlen Koltun: Photographic Image Synthesis with Cascaded Refinement Networks. International Conference on Computer Vision (ICCV), 2017.

[12] Du Q. Huynh: Metrics for 3D Rotations: Comparison and Analysis. Journal of Mathematical Imaging and Vision, 2009.

[13] Ruben Villegas, Jimei Yang, Seunghoon Hong, Xunyu Lin, Honglak Lee: Decomposing Motion and Content for Natural Video Sequence Prediction.
International Conference on Learning Representations (ICLR), 2017.

[14] Emily Denton, Rob Fergus: Stochastic Video Generation with a Learned Prior. arXiv preprint arXiv: 1802.07687, 2018

[15] Emily Denton, Vighnesh Birodkar: Unsupervised Learning of Disentangled Representations from Video. Advances in Neural Information Processing Systems (NIPS), 2017.

[16] William Lotter, Gabriel Kreiman, David Cox: Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning. International Conference on Learning Representations (ICLR), 2017.

[17] Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, Sergey Levine: One-Shot Visual Imitation Learning via Meta-Learning. Conference on Robot Learning (CoRL), 2017.

[18] Frederik Ebert, Chelsea Finn, Alex X. Lee, Sergey Levine: Self-Supervised Visual Planning with Temporal Skip Connections. Conference on Robot Learning (CoRL), 2017.

[19] Chelsea Finn, Sergey Levine: Deep Visual Foresight for Planning Robot Motion. International Conference on Robotics and Automation (ICRA), 2017.

[20] Ruben Villegas, Jimei Yang, Yuliang Zou, Sungryull Sohn, Xunyu Lin, Honglak Lee: Learning to Generate Long-term Future via Hierarchical Prediction. International Conference on Machine Learning (ICML), 2017.

[21] Andrew Jaegle, Oleh Rybkin, Konstantinos G. Derpanis, Kostas Daniilidis: Predicting the Future with Transformational States. arXiv preprint arXiv: 1803.09760, 2018.

[22] Diederik P Kingma, Max Welling: Auto-Encoding Variational Bayes. International Conference on Learning Representations (ICLR), 2014.

[23] Bishop, Christopher: Pattern Recognition and Machine Learning. Springer-Verlag New York, 2006.

[24] Danilo Jimenez Rezende, Shakir Mohamed, Daan Wierstra: Stochastic Backpropagation and Approximate Inference in Deep Generative Models. International Conference on International Conference on Machine Learning, 2014.

[25] Ian Goodfellow and Yoshua Bengio and Aaron Courville: Deep Learning. MIT Press, 2016.

# Footnotes

1. While the absolute difference loss is not commonly used for video prediction, it suffers from similar problems.
2. This simplification is not usually adequate, as there is some amount of inductive bias baked into the network. However, the inductive bias in current networks is still nowhere close to the stage where it can ensure that the optimization trajectory goes through only realistic images.

# Acknowledgments

I thank Drew Jaegle, Kosta Derpanis and Stephen Phillips for a lot of useful comments, suggestions and additions.

--

--