Nerd For Tech
Published in

Nerd For Tech

Review — Zhong ELECGJ’21: A GAN-Based Video Intra Coding (HEVC Intra)

Outperforms IPFCN, IPCNN, and Spatial RNN. Lower Complexity Than Zhu TMM’20

In this story, A GAN-Based Video Intra Coding, (Zhong ELECGJ’21), by Sun Yat-Sen University, Southern Marine Science and Engineering Guangdong Laboratory, and Peng Cheng Laboratory, is briefly reviewed. In this paper:

  • GAN is used as a mapping from the adjacent reconstructed signals to the prediction unit, to enhance intra prediction accuracy.

This is a paper in 2021 ELECGJ, MDPI Journal of Electronics and Its Applications, with impact factor of 2.412 (2019). (Sik-Ho Tsang @ Medium)


  1. Proposed GAN
  2. Experimental Results

1. Proposed GAN

Proposed generative adversarial network framework
  • The generator G is used for predicting the coding block while the discriminator D is a critic to distinguish whether the generated unit is genuine or artificial.
  • The input is 24 × 24 picture where the bottom-right 8×8 is the block we want to predict while the others are original pixels.

1.1. Generator

  • A 2-stage coarse-to-fine generator is used.
  • The coarse one shares the same parameters with the refinement network.
  • Compared to “Generative image inpainting with contextual attention”, some downsampling and dilated convolutions are removed since the input size is small, not the whole picture.
  • The context attention layer is also removed.
  • Exponential Linear Unit (ELU) is used for each convolution, except the last layer.
  • At the last output layer, it is clipped to [-1.1].

1.2. Discriminator

Local Discriminator
Global Discriminator
  • For discriminator, there is a global discriminator and a local discriminator.
  • The global discriminator adopts the whole 24 × 24 picture as input to determine the overall coherence of the completed image, while the local discriminator takes just the 16 × 16 block to be predicted as input to enhance the regional consistency.
  • All convolutions are with 5×5 kernel size and stride of 2.

1.3. Loss Function

  • Pixel-wise l1 loss is used instead of Mean Square Error (MSE).
  • Considering the fact that closer pixels have stronger spatial correlation, spatially weighted l1 loss is introduced using a weight mask m.
  • Wasserstein GAN is considered for improving the GAN stability:
  • More specifically, Wasserstein GAN with Gradient Penalty (WGAN-GP) is used where WGAN-GP is an advanced edition of WGAN with a gradient penalty subitem:
  • As we only try to predict the coding block at the bottom-right corner; hence, the gradient penalty item should only be applied to samples within the predicted block:
  • where m is a binary mask that takes the value 0 inside bottom-right region, and ⊙ denotes pixel-wise multiplication.
  • The overall adversarial loss:

1.4. Training Strategy

  • The training dataset is New York city library. The dataset consists of a total of 2550 pictures with various sizes.
  • With traversing and cropping, a total of 2.4 million images are finally obtained.
  • Different from Zhu TMM’20, the original pixels fetched from the ground truth images are used for training.
  • Only luminance is used.

1.5. Integration into HEVC

Illustration of the luma mode derivation.
  • The proposed mode is treated as an additional prediction alongside the 35 intra prediction within the CU intra mode.
Illustration of the mode signaling for the luma modes.
  • One signaling bit is used to indicate the use of the conventional intra mode or the use of proposed mode.

2. Experimental Results

BD-Rate (%)
  • HM-16.15 is used. All intra configuration is used.
  • The proposed stage_2 strategy outperforms stage_1 strategy in all test cases. The proposed stage_2 strategy achieves an average of 1.6% BD-rate reduction while the stage_1 strategy achieves an average of 1.2% BD-rate reduction on the luminance component.
  • It demonstrates the effectiveness of the two-stage coarse-to-fine generator network.
Comparisons with SOTA approaches
  • The above SOTA approaches are dedicated to 8 × 8 block prediction.
  • The proposed approach is redesigned. GAN is still predicting the 16 × 16 block. But only 8 × 8 blocks can use the GAN intra prediction. When it is being used, the 8 × 8 block copies the pixels from the 16 × 16 block corresponding to the block location.
  • As shown above, our proposal achieves a better coding gain and outperforms previous similar works: IPFCN [15], IPCNN [17], and Spatial RNN [18–19].
Comparison with Zhu TMM’20
  • Though BD-rate reduction of the proposed method is smaller than the Zhu TMM’20 one, it obtains much lower encoder and decoder complexities.

NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To know more about us, visit

Recommended from Medium

LeNet-5 in 9 lines of code using Keras

The technology behind modern automatic text generation

Text Summarization in Python using Extractive method (including end-to-end implementation)

A game of darts in Bias and Variance

Natural Language Processing (NLP) in Python — Simplified

Implement a COVID-19 (Coronavirus) Survival Calculator using IBM Watson’s Auto-AI within minutes.

Review: Learning Word Embeddings Efficiently with Noise-Contrastive Estimation (NCE)

Challenges in Deploying Machine Learning Systems

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Sik-Ho Tsang

Sik-Ho Tsang

PhD, Researcher. I share what I've learnt and done. :) My LinkedIn:, My Paper Reading List:

More from Medium

Review: Vision Transformer (ViT)

Face Tracking With the Ryze Tello, Part 1: Face Detection

Automatic Liver Segmentation — Part 1/4: Introduction

InfoGAN: learning to generate controllable images from scratch (Pytorch)