Paper Summary: Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation

5 min readNov 19, 2018

Part of the series A Month of Machine Learning Paper Summaries. Originally posted here on 2018/11/04.

Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation (2017) https://arxiv.org/abs/1712.00080 Huaizu Jiang, Deqing Sun, Varun Jampani, Ming-Hsuan Yang, Erik Learned-Miller, Jan Kautz

This paper is quite a bit more detailed than previous papers I’ve summarized — both technically and in terms of comparison to prior work — so in the interests of getting something onto the page I’ll aim to describe just the approach as is, with an emphasis on intuition rather than formalism. The topic is video interpolation — inserting frames between existing frames in video — and the results are impressive: see the video below for some inspiration.

The idea behind the paper is that the difference between two frames of video can be thought of, to a first approximation, as an optical flow. That is, most pixels in one frame correspond to pixels in the other, shifted in some direction. There are well-known algorithms for calculating optical flow between images (including CNN-based approaches). Once you have an optical flow you can interpolate intermediate frames simply by translating pixels by the interpolated quantity t. This can be improved further by incorporating the time-reversed optical flow. Taken together the paper refers to these forward and backward flows as bi-directional optical flows.

Unfortunately this approach doesn’t look great on its own, especially near the boundaries of fast-moving objects, because some pixels are occluded in one frame or the other. (Occasionally there are intermediate pixels that are occluded in both start and end frames, but this is relatively rare.) There’s also a problem of blurring.

So the authors compute visibility maps (to track occlusions) while simultaneously refining the initial estimation of the bi-directional flow maps in a second CNN. Jointly training flow map refinement and visibility improved performance significantly, which makes a certain amount of sense since I’d expect the same features would predict flow and occlusion. There were further improvements in the design of the loss function, which I’ll get to in a bit, but first some architecture details.

The architecture consists of two networks: a flow computation network and a flow interpolation network (see screenshot below). The flow computation network is only responsible for calculating the bi-directional optical flow given two input frames. Then, for each intermediate frame at time t, the flow interpolation network calculates an improved flow estimate for the partial flows Ft→0 and Ft→1 (actually the residuals ΔFt→0 and ΔFt→1) as well as the visibility maps, which are combined into a prediction of the intermediate frame. Both networks are U-Net architectures (paper summarized yesterday), each with 5 downsampling (average pooling) layers and 5 bilinear upsampling layers with conv-LeakyReLU-conv in between and U-Net-style skip connections (see screenshot above). More details: the flow computation network’s first two layers used large filters (7x7 and 5x5); the flow interpolation’s skip connections used features from both networks for a slight improvement.

It’s worth taking a closer look at the inputs to the flow interpolation network, since I’ve left out some important details. The estimates of the partial flows

and

are just linear interpolations of the bi-directional flows between the input frames. The g(·, ·) function is a backward warping function (implemented with bilinear interpolation) that effectively reverses the effects of an optical flow on an image. You’ll notice there are the two applications of g that become inputs to the second network. My interpretation: these are the pre-images of the two partial flows from the start and end frames. These pre-images should already be pretty decent approximations to the interpolated frame, so intuitively the flow interpolation network shouldn’t have to do much work to improve them.

§3.1 of the paper has more details on the thinking behind the above — it’s a bit heavy on the math, but still relatively readable, so I encourage you to take a look if you’re curious. One confusing bit (or at least confusing to me) is that the discussion starts out assuming you have ground truth bi-directional flows between the start and end frames, and only later do they talk about computing these flows with the flow computation network. So be forewarned.

Loss function. The authors also put careful thought into the design of the loss function, which was a linear combination of four losses: reconstruction, perceptual, warping, and smoothness.

Reconstruction loss is L1 loss in RGB space against the ground truth intermediate frame
Perceptual loss minimizes blurring by encouraging objects to be more object-like (that’s my interpretation, anyway) — they used L2 loss on the conv4_3 features of a VGG16 net pretrained on ImageNet
Warping loss used the backward warping function g to compare images to their flow-warped counterparts, for all relevant image pairs
Smoothness loss penalized the L1 norm of the gradient of the bi-directional flows to encourage neighboring pixels to have similar flow values.

Ablation studies. All the cool kids (responsible kids?) are doing it these days. One interesting test was that predicting multiple frames in a single pass improved performance. This is intuitively not too surprising, though the mechanism isn’t clear to me. They also tested removing flow refinement, visibility mapping, and the various parts of the loss function. Interestingly the smoothness loss term hurt performance a bit, but looked better to human observers! Perhaps this suggests that the metrics (again PSNR, SSIM, and a thing called interpolation error (IE)) are subtly flawed?

There’s also some discussion of training details and data augmentation, as well as datasets and results, none of which was too surprising. I would have liked to see some indication of how long it took to train on what kind of hardware, but perhaps NVIDIA is playing close to the vest on this front.

Paper Summary: Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation

Written by Mike Plotz Sage