Review — Zhong ELECGJ’21: A GAN-Based Video Intra Coding (HEVC Intra)

Outperforms IPFCN, IPCNN, and Spatial RNN. Lower Complexity Than Zhu TMM’20

Sik-Ho Tsang
Nerd For Tech
Published in
5 min readMar 6, 2021


In this story, A GAN-Based Video Intra Coding, (Zhong ELECGJ’21), by Sun Yat-Sen University, Southern Marine Science and Engineering Guangdong Laboratory, and Peng Cheng Laboratory, is briefly reviewed. In this paper:

  • GAN is used as a mapping from the adjacent reconstructed signals to the prediction unit, to enhance intra prediction accuracy.

This is a paper in 2021 ELECGJ, MDPI Journal of Electronics and Its Applications, with impact factor of 2.412 (2019). (Sik-Ho Tsang @ Medium)


  1. Proposed GAN
  2. Experimental Results

1. Proposed GAN

Proposed generative adversarial network framework
  • The generator G is used for predicting the coding block while the discriminator D is a critic to distinguish whether the generated unit is genuine or artificial.
  • The input is 24 × 24 picture where the bottom-right 8×8 is the block we want to predict while the others are original pixels.

1.1. Generator

  • A 2-stage coarse-to-fine generator is used.
  • The coarse one shares the same parameters with the refinement network.
  • Compared to “Generative image inpainting with contextual attention”, some downsampling and dilated convolutions are removed since the input size is small, not the whole picture.
  • The context attention layer is also removed.
  • Exponential Linear Unit (ELU) is used for each convolution, except the last layer.
  • At the last output layer, it is clipped to [-1.1].

1.2. Discriminator

Local Discriminator
Global Discriminator
  • For discriminator, there is a global discriminator and a local discriminator.
  • The global discriminator adopts the whole 24 × 24 picture as input to determine the overall coherence of the completed image, while the local discriminator takes just the 16 × 16 block to be predicted as input to enhance the regional consistency.
  • All convolutions are with 5×5 kernel size and stride of 2.

1.3. Loss Function

  • Pixel-wise l1 loss is used instead of Mean Square Error (MSE).
  • Considering the fact that closer pixels have stronger spatial correlation, spatially weighted l1 loss is introduced using a weight mask m.
  • Wasserstein GAN is considered for improving the GAN stability:
  • More specifically, Wasserstein GAN with Gradient Penalty (WGAN-GP) is used where WGAN-GP is an advanced edition of WGAN with a gradient penalty subitem:
  • As we only try to predict the coding block at the bottom-right corner; hence, the gradient penalty item should only be applied to samples within the predicted block:
  • where m is a binary mask that takes the value 0 inside bottom-right region, and ⊙ denotes pixel-wise multiplication.
  • The overall adversarial loss:

1.4. Training Strategy

  • The training dataset is New York city library. The dataset consists of a total of 2550 pictures with various sizes.
  • With traversing and cropping, a total of 2.4 million images are finally obtained.
  • Different from Zhu TMM’20, the original pixels fetched from the ground truth images are used for training.
  • Only luminance is used.

1.5. Integration into HEVC

Illustration of the luma mode derivation.
  • The proposed mode is treated as an additional prediction alongside the 35 intra prediction within the CU intra mode.
Illustration of the mode signaling for the luma modes.
  • One signaling bit is used to indicate the use of the conventional intra mode or the use of proposed mode.

2. Experimental Results

BD-Rate (%)
  • HM-16.15 is used. All intra configuration is used.
  • The proposed stage_2 strategy outperforms stage_1 strategy in all test cases. The proposed stage_2 strategy achieves an average of 1.6% BD-rate reduction while the stage_1 strategy achieves an average of 1.2% BD-rate reduction on the luminance component.
  • It demonstrates the effectiveness of the two-stage coarse-to-fine generator network.
Comparisons with SOTA approaches
  • The above SOTA approaches are dedicated to 8 × 8 block prediction.
  • The proposed approach is redesigned. GAN is still predicting the 16 × 16 block. But only 8 × 8 blocks can use the GAN intra prediction. When it is being used, the 8 × 8 block copies the pixels from the 16 × 16 block corresponding to the block location.
  • As shown above, our proposal achieves a better coding gain and outperforms previous similar works: IPFCN [15], IPCNN [17], and Spatial RNN [18–19].
Comparison with Zhu TMM’20
  • Though BD-rate reduction of the proposed method is smaller than the Zhu TMM’20 one, it obtains much lower encoder and decoder complexities.



Sik-Ho Tsang
Nerd For Tech

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.