Google’s Generative Video Compression Technique Outperforms Traditional Neural Video Compression

Published in

SyncedReview

4 min readAug 4, 2021

While the increasing use of video streaming and conferencing has enabled new entertainment and remote work opportunities, efficiently lessening data transmission loads has proven challenging for most existing video compression techniques.

Video compression is the process of reducing the total number of bits needed to store a video while preserving visual content quality by leveraging temporal and spatial redundancies. Recent research has demonstrated the promising potential of neural networks for this task, as they can outperform the more broadly used non-neural standard High Efficiency Video Coding (HEVC).

In a new paper, a Google Research team takes a step forward in this field, proposing a neural video compression method based on generative adversarial networks (GANs) that outperforms previous neural video compression methods.

The team summarizes their contributions as:

To the best of our knowledge, this is the first neural compression approach that is competitive to HEVC in terms of visual quality, as measured in a user study. We also show that approaches competitive in terms of PSNR do fare much worse in terms of visual quality.
Propose a technique to mitigate temporal error-accumulation when unrolling via randomized shifting of the residual inputs followed by un-shifting of the outputs, motivated with a spectral analysis, and demonstrate its effectiveness in our system and for a toy linear CNN model.
Explore correlations between visual quality as measured by the user study and available video quality metrics. To facilitate future research, we release reconstructions on the MCL-JCV video data set along with all the data obtained from our user studies.

The team’s approach uses three strategies to obtain high-fidelity reconstructions: 1) Synthesize plausible details in the I-frame; 2) Propagate those details wherever possible and as sharp as possible; 3) For new content appearing in P-frames, synthesize plausible details.

The proposed I-frame branch used to synthesize plausible details is based on a lightweight version of the architecture used in HiFiC, in which the encoder CNN maps the initial input image to a quantized latent. At a high level, the P-frame branch used to propagate those details comprises auto-encoders for both the flow and the residual. The team employs a powerful optical flow predictor network, UFlow, on the encoder side. The resulting flow outputs the quantized and entropy-coded flow-latent, while the generator predicts both a reconstructed flow and a confidence mask. Intuitively, this mask predicts the accuracy for each pixel in the flow, which is used to determine how much to blur the “scale-space blur” component described next.

The approach first warps the previous reconstruction with compressed flow using bicubic warping, then uses scale-space blurring — a light variation of the “scale-space flow” approach — to enable a more efficient implementation. Together, bicubic warping and blurring help to propagate sharper details and facilitate smoother blurring.

To synthesize plausible details in new content appearing in P-frames, the proposed approach employs the light version of the HiFiC architecture for residual auto-encoders, and introduces an additional source of information for the residual generator to enable it to synthesize high-frequency details from the residual latent.

The researchers also propose a technique to mitigate temporal error-accumulation problems, which is crucial for obtaining high visual quality. To this end, and motivated by a spectral analysis, they adopt a new training schema by randomizing the shifting of residual inputs followed by an un-shifting of the outputs.

The team evaluated their proposed model on 30 diverse videos from MCL-JCV, which include a wide variety of motion from natural videos, computer animation and classical animation. They compared their approach with baseline “MSE-only,” “Scale-Space Flow” (SSF), and the non-learned HEVC. They reported results based on non-overlapping 256×256 patches and the unsupervised perceptual quality Perceptual Information Metric (PIM), introduced by Bhardwaj et al. in 2020.

Overall, the study shows that the proposed method is competitive to HEVC and outperforms previous neural video compression codecs, validating the promising potential of GANs for improving video compression performance.

The paper Towards Generative Video Compression is on arXiv.

Author: Hecate He | Editor: Michael Sarazen, Chain Zhang

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

Google’s Generative Video Compression Technique Outperforms Traditional Neural Video Compression

Written by Synced