Review — SEFCNN: A Switchable Deep Learning Approach for In-loop Filtering in Video Coding (HEVC Filtering)
Using Network Concepts of VDSR and SENet, Outperforms VRCNN, RHCNN, and MLSDRN
In this paper, A Switchable Deep Learning Approach for In-loop Filtering in Video Coding, (SEFCNN), by Hangzhou Normal University, Visionular Inc., and Chang’an University, is reviewed. In this paper:
- A Squeeze-and-Excitation Filtering CNN (SEFCNN), is designed with two subnets: Feature EXtracting (FEX) subnet and Feature ENhancing (FEN) subnet.
- Different models are trained using both FEX and FEN for different types of frames.
- Finally, an adaptive enhancing mechanism is proposed which is switchable between the CNN-based and the conventional methods.
This is a paper in 2020 TCVST where TCSVT has a high impact factor of 4.133. (Sik-Ho Tsang @ Medium)
Outline
- SEFCNN: Network Architecture
- Model Training Strategy
- Experimental Results
1. SEFCNN: Network Architecture
- SEFCNN is as an optional in-loop filter in H.265/HEVC.
- SEFCNN is comprised of two subnets, i.e., the low-level Feature EXtracting (FEX) net and the high-level Feature ENhancing (FEN) net.
- Each subnet can be invoked and trained individually.
- The long identity skip connection is then directly added to the residuals to generate the output image.
1.1. Subnet FEX
- The network is inspired by the success of VDSR.
- The input data undergo N stacking layers and are transformed into high-level features.
- For each convolutional layer, we set the kernel size to 3×3 and use 64 filters.
1.2. Subnet FEN
- In FEN, the left branch is for residual learning and the right branch serves as skip connection.
- At the beginning of the left branch, 3 convolutional layers are cascaded to obtain advanced features of channels.
- Afterwards, the Squeeze-and-Excitation (SE) block, originated in SENet, is performed to further boost the representational power of the network.
- Accordingly, it consists of two steps, squeeze and excitation.
- The squeeze step is employed by applying Global Average Pooling (GAP) on the input U:
- Next, the excitation process is designed to emphasize the useful channels by adjusting their corresponding weight parameters. 2 convolutional layers are first introduced for non-linear mapping.
- Finally, each channel of input U is weighted and recalibrated by sk.
The SE block can efficiently sort out and strengthen the informative feature maps.
- (If interested, please feel free to visit SENet.)
- Finally, it is added with the right branch:
- where bk is calculated by a 1×1 convolution.
2. Model Training Strategy
2.1. Specific Models for Different QP’s
- The reconstructed frame with higher QP is usually of lower quality and contains more artifacts.
- In other words, different models are trained for different QP levels.
- Global Model: One model for all QPs.
- Separate Model: One model for one QP, which outperforms global model.
2.2. Hierarchical CNN Structures for Different QP’s
- In brief, according to the above results, at the high bitrate scenarios where QP is equal to 22 or 27, only subnet FEX is involved.
- At the low bitrate scenarios where QP is equal to 32 or 37, the entire SEFCNN is launched.
- Relative to SEFCNN, subnet FEX has fewer layers and parameters, which efficiently reduces the computational burden.
2.3. Hierarchical CNN Models for Different Frame Types
- I frame is characterized by textures and directions.
- In contrast, the predicted samples of P and B frames are obtained from motion estimation.
- In addition to predicted values, the residual values of I, P, and B frames also carry different characteristics because different frame types employ different coding tools in the forward/inverse transform and quantization processes. It is apparent that their fitting functions on the in-loop filtering problem are different.
- Training a separate model for P frame achieves average 0.036 dB PSNR gain and the corresponding bitrate is reduced by 1.851%, whereas for B frame, the gain is 0.055 dB and the bitrate is also slightly declined.
- At QP = 27, sharing a single model is better.
When QP is 37 or 32, an individual model is trained for each frame type for the pursuit of higher performance. When QP is 27 or 22, I model is shared among I, P, and B frames.
2.4. Switchable Enhancing at CU Level
- The essential idea is only to enhance the frame that will seldom be referred in future or the regions within a frame whose predicted samples are not enhanced.
- The use of SEFCNN depends on the relative temporal position in the frame order under various configurations. SEFCNN is enabled based on a lot of results. Please refer to the paper for much more details.
3. Experimental Results
3.1. Comparison with VRCNN
- In AI configuration, SEFCNN obtains 9.96% BD-rate reduction, whereas there is only 3.03% gain for VRCNN.
- For inter coding, 8.04% and 7.60% BD-rate reduction are achieved by SEFCNN and the corresponding value of VRCNN is 3.38% and 4:85% in LDP and RA, respectively.
3.2. Comparison with RHCNN
- SEFCNN surpasses RHCNN by 0.306 dB, 0.154 dB, and 0.181 dB, let along the bitrate reduction.
- Again, the proposed SEFCNN outperforms RHCNN [39].
3.3. Comparison with MLSDRN
- Again, the proposed SEFCNN outperforms MLSDRN [40].
3.4. Complexity Analysis
- The encoding time of SEFCNN is no more than twice of HM16.9 in the low video resolution, such as 416×240.
- As the video resolution increases to 1280×720, the extra running time is no more than 38% and 19% in intra and inter coding, respectively.
- There are also decoding time measurement as above.
There are a lot of experiments in the paper. If interested, please feel free to visit the paper.
Reference
[2020 TCSVT] [SEFCNN]
A Switchable Deep Learning Approach for In-loop Filtering in Video Coding
Codec Filtering
JPEG [ARCNN] [RED-Net] [DnCNN] [Li ICME’17] [MemNet] [MWCNN] [CAR-DRN]
JPEG-HDR [Han VCIP’20]
HEVC [Lin DCC’16] [IFCNN] [VRCNN] [DCAD] [MMS-net] [DRN] [Lee ICCE’18] [DS-CNN] [CNNF] [RHCNN] [VRCNN-ext] [S-CNN & C-CNN] [MLSDRN] [ARTN] [Double-Input CNN] [CNNIF & CNNMC] [B-DRRN] [Residual-VRN] [Liu PCS’19] [DIA_Net] [RRCNN] [QE-CNN] [MRRN] [Jia TIP’19] [EDCNN] [VRCNN-BN] [MACNN] [Yue VCIP’20] [SEFCNN]
3D-HEVC [RSVE+POST]
AVS3 [Lin PCS’19] [CNNLF]
VVC [AResNet] [Lu CVPRW’19] [Wang APSIPA ASC’19] [ADCNN] [PRN] [DRCNN] [Zhang ICME’20] [MGNLF] [RCAN+PRN+] [Nasiri VCIP’20]