Review — Zhong ELECGJ’21: A GAN-Based Video Intra Coding (HEVC Intra)

Outperforms IPFCN, IPCNN, and Spatial RNN. Lower Complexity Than Zhu TMM’20

Published in

Nerd For Tech

5 min readMar 6, 2021

In this story, A GAN-Based Video Intra Coding, (Zhong ELECGJ’21), by Sun Yat-Sen University, Southern Marine Science and Engineering Guangdong Laboratory, and Peng Cheng Laboratory, is briefly reviewed. In this paper:

GAN is used as a mapping from the adjacent reconstructed signals to the prediction unit, to enhance intra prediction accuracy.

This is a paper in 2021 ELECGJ, MDPI Journal of Electronics and Its Applications, with impact factor of 2.412 (2019). (Sik-Ho Tsang @ Medium)

Outline

Proposed GAN
Experimental Results

1. Proposed GAN

**Proposed generative adversarial network framework**

The generator G is used for predicting the coding block while the discriminator D is a critic to distinguish whether the generated unit is genuine or artificial.
The input is 24 × 24 picture where the bottom-right 8×8 is the block we want to predict while the others are original pixels.

1.1. Generator

A 2-stage coarse-to-fine generator is used.
The coarse one shares the same parameters with the refinement network.
Compared to “Generative image inpainting with contextual attention”, some downsampling and dilated convolutions are removed since the input size is small, not the whole picture.
The context attention layer is also removed.
Exponential Linear Unit (ELU) is used for each convolution, except the last layer.
At the last output layer, it is clipped to [-1.1].

1.2. Discriminator

For discriminator, there is a global discriminator and a local discriminator.
The global discriminator adopts the whole 24 × 24 picture as input to determine the overall coherence of the completed image, while the local discriminator takes just the 16 × 16 block to be predicted as input to enhance the regional consistency.
All convolutions are with 5×5 kernel size and stride of 2.

1.3. Loss Function

Pixel-wise l1 loss is used instead of Mean Square Error (MSE).
Considering the fact that closer pixels have stronger spatial correlation, spatially weighted l1 loss is introduced using a weight mask m.
Wasserstein GAN is considered for improving the GAN stability:

More specifically, Wasserstein GAN with Gradient Penalty (WGAN-GP) is used where WGAN-GP is an advanced edition of WGAN with a gradient penalty subitem:

As we only try to predict the coding block at the bottom-right corner; hence, the gradient penalty item should only be applied to samples within the predicted block:

where m is a binary mask that takes the value 0 inside bottom-right region, and ⊙ denotes pixel-wise multiplication.
The overall adversarial loss:

(Please read Wasserstein GAN for more details.)

1.4. Training Strategy

The training dataset is New York city library. The dataset consists of a total of 2550 pictures with various sizes.
With traversing and cropping, a total of 2.4 million images are finally obtained.
Different from Zhu TMM’20, the original pixels fetched from the ground truth images are used for training.
Only luminance is used.

1.5. Integration into HEVC

**Illustration of the luma mode derivation.**

The proposed mode is treated as an additional prediction alongside the 35 intra prediction within the CU intra mode.

**Illustration of the mode signaling for the luma modes.**

One signaling bit is used to indicate the use of the conventional intra mode or the use of proposed mode.

2. Experimental Results

HM-16.15 is used. All intra configuration is used.
The proposed stage_2 strategy outperforms stage_1 strategy in all test cases. The proposed stage_2 strategy achieves an average of 1.6% BD-rate reduction while the stage_1 strategy achieves an average of 1.2% BD-rate reduction on the luminance component.
It demonstrates the effectiveness of the two-stage coarse-to-fine generator network.

The above SOTA approaches are dedicated to 8 × 8 block prediction.
The proposed approach is redesigned. GAN is still predicting the 16 × 16 block. But only 8 × 8 blocks can use the GAN intra prediction. When it is being used, the 8 × 8 block copies the pixels from the 16 × 16 block corresponding to the block location.
As shown above, our proposal achieves a better coding gain and outperforms previous similar works: IPFCN [15], IPCNN [17], and Spatial RNN [18–19].