Review — Zhong ELECGJ’21: A GAN-Based Video Intra Coding (HEVC Intra)

Outperforms IPFCN, IPCNN, and Spatial RNN. Lower Complexity Than Zhu TMM’20

Sik-Ho Tsang
Mar 6 · 5 min read

In this story, A GAN-Based Video Intra Coding, (Zhong ELECGJ’21), by Sun Yat-Sen University, Southern Marine Science and Engineering Guangdong Laboratory, and Peng Cheng Laboratory, is briefly reviewed. In this paper:

  • GAN is used as a mapping from the adjacent reconstructed signals to the prediction unit, to enhance intra prediction accuracy.

This is a paper in 2021 ELECGJ, MDPI Journal of Electronics and Its Applications, with impact factor of 2.412 (2019). (Sik-Ho Tsang @ Medium)


  1. Proposed GAN
  2. Experimental Results

1. Proposed GAN

Proposed generative adversarial network framework
  • The generator G is used for predicting the coding block while the discriminator D is a critic to distinguish whether the generated unit is genuine or artificial.
  • The input is 24 × 24 picture where the bottom-right 8×8 is the block we want to predict while the others are original pixels.

1.1. Generator

  • A 2-stage coarse-to-fine generator is used.
  • The coarse one shares the same parameters with the refinement network.
  • Compared to “Generative image inpainting with contextual attention”, some downsampling and dilated convolutions are removed since the input size is small, not the whole picture.
  • The context attention layer is also removed.
  • Exponential Linear Unit (ELU) is used for each convolution, except the last layer.
  • At the last output layer, it is clipped to [-1.1].

1.2. Discriminator

Local Discriminator
Global Discriminator
  • For discriminator, there is a global discriminator and a local discriminator.
  • The global discriminator adopts the whole 24 × 24 picture as input to determine the overall coherence of the completed image, while the local discriminator takes just the 16 × 16 block to be predicted as input to enhance the regional consistency.
  • All convolutions are with 5×5 kernel size and stride of 2.

1.3. Loss Function

  • Pixel-wise l1 loss is used instead of Mean Square Error (MSE).
  • Considering the fact that closer pixels have stronger spatial correlation, spatially weighted l1 loss is introduced using a weight mask m.
  • Wasserstein GAN is considered for improving the GAN stability:
  • More specifically, Wasserstein GAN with Gradient Penalty (WGAN-GP) is used where WGAN-GP is an advanced edition of WGAN with a gradient penalty subitem:
  • As we only try to predict the coding block at the bottom-right corner; hence, the gradient penalty item should only be applied to samples within the predicted block:
  • where m is a binary mask that takes the value 0 inside bottom-right region, and ⊙ denotes pixel-wise multiplication.
  • The overall adversarial loss:

1.4. Training Strategy

  • The training dataset is New York city library. The dataset consists of a total of 2550 pictures with various sizes.
  • With traversing and cropping, a total of 2.4 million images are finally obtained.
  • Different from Zhu TMM’20, the original pixels fetched from the ground truth images are used for training.
  • Only luminance is used.

1.5. Integration into HEVC

Illustration of the luma mode derivation.
  • The proposed mode is treated as an additional prediction alongside the 35 intra prediction within the CU intra mode.
Illustration of the mode signaling for the luma modes.
  • One signaling bit is used to indicate the use of the conventional intra mode or the use of proposed mode.

2. Experimental Results

BD-Rate (%)
  • HM-16.15 is used. All intra configuration is used.
  • The proposed stage_2 strategy outperforms stage_1 strategy in all test cases. The proposed stage_2 strategy achieves an average of 1.6% BD-rate reduction while the stage_1 strategy achieves an average of 1.2% BD-rate reduction on the luminance component.
  • It demonstrates the effectiveness of the two-stage coarse-to-fine generator network.
Comparisons with SOTA approaches
  • The above SOTA approaches are dedicated to 8 × 8 block prediction.
  • The proposed approach is redesigned. GAN is still predicting the 16 × 16 block. But only 8 × 8 blocks can use the GAN intra prediction. When it is being used, the 8 × 8 block copies the pixels from the 16 × 16 block corresponding to the block location.
  • As shown above, our proposal achieves a better coding gain and outperforms previous similar works: IPFCN [15], IPCNN [17], and Spatial RNN [18–19].
Comparison with Zhu TMM’20
  • Though BD-rate reduction of the proposed method is smaller than the Zhu TMM’20 one, it obtains much lower encoder and decoder complexities.

Nerd For Tech

From Confusion to Clarification

Nerd For Tech

NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To know more about us, visit Don’t forget to check out Ask-NFT, a mentorship ecosystem we’ve started

Sik-Ho Tsang

Written by

PhD, Researcher. I share what I've learnt and done. :) My LinkedIn:, My Paper Reading List:

Nerd For Tech

NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To know more about us, visit Don’t forget to check out Ask-NFT, a mentorship ecosystem we’ve started

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store