Review — Zhong ELECGJ’21: A GAN-Based Video Intra Coding (HEVC Intra)
Outperforms IPFCN, IPCNN, and Spatial RNN. Lower Complexity Than Zhu TMM’20
In this story, A GAN-Based Video Intra Coding, (Zhong ELECGJ’21), by Sun Yat-Sen University, Southern Marine Science and Engineering Guangdong Laboratory, and Peng Cheng Laboratory, is briefly reviewed. In this paper:
- GAN is used as a mapping from the adjacent reconstructed signals to the prediction unit, to enhance intra prediction accuracy.
This is a paper in 2021 ELECGJ, MDPI Journal of Electronics and Its Applications, with impact factor of 2.412 (2019). (Sik-Ho Tsang @ Medium)
Outline
- Proposed GAN
- Experimental Results
1. Proposed GAN
- The generator G is used for predicting the coding block while the discriminator D is a critic to distinguish whether the generated unit is genuine or artificial.
- The input is 24 × 24 picture where the bottom-right 8×8 is the block we want to predict while the others are original pixels.
1.1. Generator
- A 2-stage coarse-to-fine generator is used.
- The coarse one shares the same parameters with the refinement network.
- Compared to “Generative image inpainting with contextual attention”, some downsampling and dilated convolutions are removed since the input size is small, not the whole picture.
- The context attention layer is also removed.
- Exponential Linear Unit (ELU) is used for each convolution, except the last layer.
- At the last output layer, it is clipped to [-1.1].
1.2. Discriminator
- For discriminator, there is a global discriminator and a local discriminator.
- The global discriminator adopts the whole 24 × 24 picture as input to determine the overall coherence of the completed image, while the local discriminator takes just the 16 × 16 block to be predicted as input to enhance the regional consistency.
- All convolutions are with 5×5 kernel size and stride of 2.
1.3. Loss Function
- Pixel-wise l1 loss is used instead of Mean Square Error (MSE).
- Considering the fact that closer pixels have stronger spatial correlation, spatially weighted l1 loss is introduced using a weight mask m.
- Wasserstein GAN is considered for improving the GAN stability:
- More specifically, Wasserstein GAN with Gradient Penalty (WGAN-GP) is used where WGAN-GP is an advanced edition of WGAN with a gradient penalty subitem:
- As we only try to predict the coding block at the bottom-right corner; hence, the gradient penalty item should only be applied to samples within the predicted block:
- where m is a binary mask that takes the value 0 inside bottom-right region, and ⊙ denotes pixel-wise multiplication.
- The overall adversarial loss:
- (Please read Wasserstein GAN for more details.)
1.4. Training Strategy
- The training dataset is New York city library. The dataset consists of a total of 2550 pictures with various sizes.
- With traversing and cropping, a total of 2.4 million images are finally obtained.
- Different from Zhu TMM’20, the original pixels fetched from the ground truth images are used for training.
- Only luminance is used.
1.5. Integration into HEVC
- The proposed mode is treated as an additional prediction alongside the 35 intra prediction within the CU intra mode.
- One signaling bit is used to indicate the use of the conventional intra mode or the use of proposed mode.
2. Experimental Results
- HM-16.15 is used. All intra configuration is used.
- The proposed stage_2 strategy outperforms stage_1 strategy in all test cases. The proposed stage_2 strategy achieves an average of 1.6% BD-rate reduction while the stage_1 strategy achieves an average of 1.2% BD-rate reduction on the luminance component.
- It demonstrates the effectiveness of the two-stage coarse-to-fine generator network.
- The above SOTA approaches are dedicated to 8 × 8 block prediction.
- The proposed approach is redesigned. GAN is still predicting the 16 × 16 block. But only 8 × 8 blocks can use the GAN intra prediction. When it is being used, the 8 × 8 block copies the pixels from the 16 × 16 block corresponding to the block location.
- As shown above, our proposal achieves a better coding gain and outperforms previous similar works: IPFCN [15], IPCNN [17], and Spatial RNN [18–19].
- Though BD-rate reduction of the proposed method is smaller than the Zhu TMM’20 one, it obtains much lower encoder and decoder complexities.
Reference
[2021 ELECGJ] [Zhong ELECGJ’21]
A GAN-Based Video Intra Coding
Generative Adversarial Network (GAN)
Image Synthesis [GAN] [CGAN] [LAPGAN] [DCGAN] [Pix2Pix]
Super Resolution [SRGAN & SRResNet] [EnhanceNet] [ESRGAN]
Blur Detection [DMENet]
Camera Tampering Detection [Mantini’s VISAPP’19]
Video Coding [VC-LAPGAN] [Zhu TMM’20] [Zhong ELECGJ’21]
Codec Intra Prediction
JPEG [MS-ROI] [Baig JVICU’17]
HEVC [Xu VCIP’17] [Song VCIP’17] [Li VCIP’17] [Puri EUSIPCO’17] [IPCNN] [IPFCN] [HybridNN, Li ICIP’18] [Liu MMM’18] [CNNAC] [Li TCSVT’18] [Spatial RNN] [PS-RNN] [AP-CNN] [MIP] [Wang VCIP’19] [IntraNN] [CNNAC TCSVT’19] [CNN-CR] [CNNMC Yokoyama ICCE’20] [PNNS] [CNNCP] [Zhu TMM’20] [Zhong ELECGJ’21]
VVC [CNNIF & CNNMC] [Brand PCS’19] [Bonnineau ICASSP’20] [Santamaria ICMEW’20] [Zhu TMM’20]