Review: AdaConv — Video Frame Interpolation via Adaptive Convolution (Video Frame Interpolation)

Using CNN to interpolate Frames, Instead of Using Optical Flow + Pixel Synthesis

Published in

Analytics Vidhya

8 min readMay 4, 2020

**Video Frame Interpolation by Authors (From their** **website**)

In this story, Video Frame Interpolation via Adaptive Convolution (AdaConv), by Portland State University, is reviewed. (It is called AdaConv because it is named in their website rather in the paper. And in their later papers, it is also cited with the name of AdaConv.)

Conventionally, two-stage approach is used. First, optical flow is used to estimate the motion. Then, pixel is synthesized based on the estimated motion. In this paper, one-stage approach is used, i.e. convolutional neural network (CNN). This is a paper in 2017 CVPR with over 130 citations. (Sik-Ho Tsang @ Medium)

Outline

Video Frame Interpolation
Pixel Interpolation by Convolution
Network Architecture & Loss Function
Shift-and-stitch Implementation
Training & Hyper-Parameter Selection
Experimental Results

1. Video Frame Interpolation

**(a) Conventional Approach, (b) CNN Approach**

Given two video frames I1 and I2, the interpolation method aims to interpolate a frame ˆI temporally in the middle of the two input frames.
(a) Conventionally, as shown at the top part of the figure, after estimating the motion of each pixel in the middle frame, pixels at the input frames pointed by the motion are used to interpolate the pixel ˆI(x,y) at the middle frame.
(b) In this paper, CNN is used, with the provision of receptive field, a data-driven approach is used, pixel ˆI(x,y) is obtained through the trained CNN.

2. Pixel Interpolation by Convolution

Specifically, to estimate the convolutional kernel K for the output pixel (x, y), CNN takes receptive field patches R1(x, y) and R2(x, y) as input, where R1(x, y) and R2(x, y) are both centered at (x, y) in the respective input images.
The patches P1 and P2 that the output kernel will convolve in order to produce the color for the output pixel (x, y) are co-centered at the same locations as these receptive fields, but with a smaller size, as illustrated in the above figure.
A large receptive field which is larger than the patch is used to better handle the aperture problem in motion estimation.
The default receptive field size is 79×79 pixels. The convolution patch size is 41×41 and the kernel size is 41 × 82 as it is used to convolve with two patches.

3. Network Architecture & Loss Function

3.1. Network Architecture

The CNN consists of several convolutional layers as well as down-convolutions as alternatives to max-pooling layers. ReLU and and Batch Normalization (BN) are used.
The network is fully convolutional. Therefore, it is not restricted to a fixed-size input.
A critical constraint is that the coefficients of the output convolution kernel should be non-negative and sum up to one. Therefore, the final convolutional layer is connected to a spatial softmax layer to output the convolution kernel, which implicitly meets this important constraint.

3.2. Loss Function

One possible loss function can be the difference between the interpolated pixel color and the ground-truth color as:

where ˜Ci is the ground-truth color at (xi, yi) in the interpolated frame.
However, this color loss alone will lead to blurry results.
To solve this, a gradient loss is considered:

where k denotes one of the eight ways we compute the gradient.
Gki,1 and Gki,2 are the gradients of the input patches Pi,1 and Pi,2.
And ˜Gki is the ground-truth gradient at (xi, yi) in the interpolated frame.
The gradient of input patches is computed and then convolution with the estimated kernel is performed, which will result in the gradient of the interpolated image at the pixel of interest.
Thus, the final loss are the color loss and the gradient loss added together.

4. Shift-and-stitch Implementation

The shift-and-stitch [17, 32, 39] approach in which slightly shifted versions of the same input are used. This approach returns sparse results that can be combined to form the dense representation of the interpolated frame, as shown above.
Considering a frame with size 1280×720, a pixel-wise implementation of the neural network would require 921,600 forward passes through the neural network.
The shift-and-stitch implementation only requires 64 forward passes for the 64 differently shifted versions of the input.
Compared to the pixel-wise implementation that takes 104 seconds per frame on an Nvidia Titan X, the shift-and-stitch implementation only takes 9 seconds.
To handle the boundary problem, just simply zero padding is used.

5. Training & Hyper-Parameter Selection

5.1. Training Dataset

Publicly available videos from Flickr with a Creative Commons license is used. 3,000 videos downloaded using keywords, such as “driving”, “dancing”, “surfing”, “riding”, and “skiing”, which yield a diverse selection. The downloaded videos are scaled to a fixed size of 1280×720 pixels.

5.2. Hyper-Parameter Selection

In theory, the convolution kernel must be larger than the pixel motion between two frames in order to capture the motion (implicitly) to produce a good interpolation result.
A large kernel should be chosen. On the other hand, a large kernel involves a large number of values to be estimated, which increases the complexity of the network.
A convolution kernel that is large enough to capture the largest motion in the training dataset, which is 38 pixels, is chosen.
Thus, the convolution kernel size in our system is 41×82 that will be applied to two 41×41 patches is chosen.
This kernel has a few pixels larger than 38 pixels to provide pixel support for re-sampling.
And the larger receptive field is found to be 79×79 using a validation dataset, in which it achieves a good balance.

6. Experimental Results

6.1. SOTA Comparison

The Middlebury optical flow benchmark is used for evaluation.
Among the over 100 methods reported in the Middlebury benchmark, the proposed method achieves the best on Evergreen and Basketball, 2nd best on Dumptruck, and 3rd best on Backyard (at that moment).

6.2. Qualitative evaluation

Blur: The proposed method produces sharper images, especially in regions with large motion.

Abrupt brightness change: The proposed method generates more visually appealing interpolation results than flow-based methods.

Occlusion: The proposed method adopts a learning approach to obtain proper convolution kernels that lead to visually appealing pixel synthesis results for occluded regions, while generally optical flow is not reliable or unavailable in occluded regions.

6.3. Occlusion Handling

The green x is visible in both frames and the kernel shows that the color of this pixel is interpolated from both frames.
In contrast, the pixel indicated by the red x is visible only in Frame 2. We find that the sum of all the coefficients in the sub-kernel for Frame 1 is almost zero, which indicates Frame 1 does not contribute to this pixel.
Similarly, the pixel indicated by the cyan x is only visible in Frame 1.

6.4. Edge-aware pixel interpolation

The above figure shows how the kernels adapt to image features.
First, for all these kernels, only a very small number of kernel elements have non-zero values. Furthermore, all these non-zero elements are spatially grouped together. This corresponds well with a typical flow-based interpolation method.
Second, for a pixel in a flat region such as the one indicated by the green x, its kernel only has two elements with significant values. This is also consistent with the flow-based interpolation methods.
Third, more interestingly, for pixels along image edges, such as the ones indicated by the red and cyan x, the kernels are anisotropic and their orientations align well with the edge directions.

6.5. Run Time & Memory

On a single Nvidia Titan X, this implementation takes about 2.8 seconds with 3.5 gigabytes of memory for a 640×480 image,
9.1 seconds with 4.7 gigabytes for 1280×720, and
21.6 seconds with 6.8 gigabytes for 1920×1080.

6.6. Kernel Size

The method can handle is necessarily limited by the convolution kernel size. Any large motion beyond 41 pixels, cannot currently be handled by the system.

The above figure shows a pair of stereo image from the KITTI benchmark.
When using the proposed method to interpolate a middle frame between the left and right view, the car is blurred due to the large disparity (over 41 pixels), as shown in (c).
After downscaling the input images to half of their original size, the proposed method interpolates well, as shown in (d).

6.7. Others

The proposed method is unable to interpolate a frame at an arbitrary time. Now it can only operate at t=0.5.

During the days of coronavirus, let me have a challenge of writing 30 stories again for this month ..? Is it good? This is the 5th story in this month. Thanks for visiting my story..

Reference

[2017 CVPR] [AdaConv]
Video Frame Interpolation via Adaptive Convolution

Video Frame Interpolation

[AdaConv]