The Startup
Published in

The Startup

Review — DeFusionNET: Defocus Blur Detection via Recurrently Fusing and Refining Discriminative Multi-scale Deep Features (Blur Detection)

Extension of DeFusionNet (CVPR’19). Outperforms DHDE, DBM, DMENet, BTBNet, and DeFusionNet (CVPR’19)

In this story, DeFusionNET: Defocus Blur Detection via Recurrently Fusing and Refining Discriminative Multi-scale Deep Features, DeFusionNET, by China University of Geosciences, National University of Defense Technology, Peng Cheng Laboratory, University of Wollongong, Southwestern University of Finance and Economics, University of Sydney, and University of Salento, is reviewed.

  • This paper is the extension of DefusionNet in CVPR’19. Since it is an extension, I will mainly describe the new stuffs in this paper.
  • An enhanced feature fusing and refining module (FFRM) is proposed where Channel Attention Module (CAM) and Feature Adaptation Module (FAM) is used.
  • A more challenging new dataset is also proposed.

This is a paper in 2020 TPAMI where TPAMI has a high impact factor of 17.861. (Sik-Ho Tsang @ Medium)

Outline

  1. Brief Review of DeFusionNET: Network Architecture
  2. Feature Fusing and Refining Module (FFRM)
  3. Channel Attention Module (CAM)
  4. Feature Adaptation Module (FAM)
  5. A New Dataset (CTCUG)
  6. Ablation Study
  7. Comparison with SOTA Approaches

1. Brief Review of DeFusionNET: Network Architecture

DeFusionNET: Network Architecture
  • Since the overall architecture is quite close to the DefusionNet in CVPR’19. Only brief review is mentioned here.
  • The dark gray block represents the proposed FFRM module. For a given image, its multiscale features are firstly extracted by using the basic VGG network. Then the features from shallow layers and deep layers are fused as FSHF and FSEF, respectively.
  • Considering the complementary information between FSHF and FSEF, the features of deep and shallow layers are used to be refined in a cross-layer manner.
  • The feature fusion and refinement are performed step by step in a recurrent manner to alternatively refine FSHF, FSEF and the features at each layer (the times of recurrent step is empirically set to 3).
  • In addition, the deep supervision mechanism is imposed at each step and the prediction result of each layer are fused to obtain the final defocus blur map.
  • The loss function is the same as the DefusionNet in CVPR’19.

2. Feature Fusing and Refining Module (FFRM)

Feature Fusing and Refining Module (FFRM)
  • This module is more enhanced compared to the DefusionNet in CVPR’19.
  • There are Channel Attention Module (CAM) and feature adaptation module (FAM) to further enhance FFRM.
  • Since different scales of receptive views produce the features with different extents of discrimination, a channel attention module (CAM) is added after the concatenated feature maps to select more discriminative features.
  • Then a convolution layer with 1×1 kernel is employed to the discriminative concatenated feature maps is used to generate FSHF. Similarly for FSEF.
  • In order to narrow the gap between shallow and deep layers, the FSEF and FSHF are passed to a feature adaptation module (FAM) at each recurrent step.

3. Channel Attention Module (CAM)

Channel Attention Module (CAM)
  • Both global average pooling (GAP) and global maximum pooling (GMP) are used to aggregate global information, in a dual manner.
  • Firstly, GAP and GMP are leveraged to convert channel-wise global spatial features into vector descriptors, respectively.
  • The GAP captures the size of blurry regions, while GMP focuses on the defocus intensity.
  • The two attention vectors are merged using elementwise summation and leverage a simple gating mechanism with a sigmoid function and the final channel weights are:
  • Then, the final weighted channel-wise feature maps are:

4. Feature Adaptation Module (FAM)

Feature Adaptation Module (FAM)
  • There are some contradictory response of different layers, which will dilute the semantic information by adding the FSHF to deep layers, as well as damage the details by adding the FSEF to shallow layers.
  • A FAM is designed to adjust FSEF and FSHF before feature refining.
  • Two convolution layers marked in the light green box are used to learn the feature weight of each position, and the FSEF/FSHF are weighted by the learned weight maps.
  • The first convolution layer is used for feature extraction.
  • The second convolution layer is used for learning the feature weight of each position.
  • In such manner, the FSEF/FSHF can be re-scaled, i.e., the complementary information between different feature layers can be enhanced, while the contradictory information can be effectively reduced, by using the elementwise production.
  • After that, the adjusted features are added to original FSEF/FSHF for generating the output of FAM, which is used for cross-layer feature refining.

5. A New Dataset (CTCUG)

  • Most of the images of previous two datasets, the foreground objects are usually in-focus while the background is usually blurry.
  • This may be biased to object regions and reduce to foreground / background segmentation.
  • Also, the images nearly no complex background or foreground.
Some example images and their annotated ground truths of the CTCUG dataset
  • A new dataset is collected, which contains 150 images with manual pixel-wise annotations, namely CTCUG.
  • (I guess CTCUG should stand for Chang Tang, China University of Geosciences.)
  • The background is in-focus while the foreground regions are blurry.
  • For some scenes, a pair of images is taken with different defocus areas.
  • For the same class of objects, some of them are in-focus while the others are out-of-focus.
  • The images are with complex background and the in-focus area has low contrast.

6. Ablation Study

6.1. Effectiveness of FFRM

Visual comparison of detected defocus blur maps
  • As can be seen, with CAM and FAM, DeFusionNet can focus on the most discriminative features and weaken the influence of noisy features, which produces more pure detected results.
Ablation analysis using F-measure, MAE and AUC scores
  • Without CAM or FAM, i.e. wFAMwoCAM and wCAMwoFAM, the performance of DeFusionNet is dropped compared to DefusionNet.

6.2. Effectiveness of GAP and GMP

Visual comparison of detected defocus blur maps generated by DeFusionNet with/without GMP/GAP
  • DeFusionNet without GAP (denoted as noGAP) can suppress some noisy regions, but the defocus blur regions are not complete.
  • On the contrary, DeFusionNet without GMP (denoted as noGMP) can detect the complete blurry regions, but the results are mixed with some noisy regions.

6.3. Effectiveness of the Final Defocus Maps Fusion

  • The final outputs of all the layers are represented as DeFusionNet_O1 DeFusionNet_O2, DeFusionNet_O3, DeFusionNet_O4, DeFusionNet_O5, DeFusionNet_O6.
  • The results of fused one are much better.

6.4. Effectiveness of the Times of Recurrent Steps

Visual results at different time steps
Ablation analysis of the times of recurrent steps
  • Similar to DefusionNet in CVPR’19, DeFusionNet can obtain relatively stable results when the times of recurrent is 3.
  • Thus, the times of recurrent is set to 3 for the tradeoff between effectiveness and efficiency.

7. Comparison with SOTA Approaches

7.1. Quantitative Comparison

Quantitative comparison of F-measure, MAE and AUC scores
  • DeFusionNet consistently performs favorably against other methods, such as DHDE, DBM, DMENet, BTBNet, and DeFusionNet (CVPR’19) on the three datasets,
Precision-recall curves, F-measure curves and ROC curves on Shi’s dataset
Precision-recall curves, F-measure curves and ROC curves on DUT dataset
Precision-recall curves, F-measure curves and ROC curves on CTCUG dataset
  • For PR curves, F-measure curves and ROC curves on three datasets, DeFusionNet also consistently outperforms other counterparts.

7.2. Qualitative Comparison

Visual comparison of detected defocus blur maps generated from different methods
  • Since both DBM and BTBNet rely heavily on high-level semantic information, there results loss a large amount of fine details of region boundaries.
  • As to our DeFusionNet, both high-level semantic information and low-level details are fully captured. Therefore, it obtains better results with complete blur regions.

7.3. Running Efficiency

  • The average running time for an image of different methods on the three different datasets are 0.097s, 0.059s and 0.068s, respectively, while BTBNet needs about 25s to generate the defocus blur map for a testing image with 320×320 pixels.

7.4. Convergence Property of the Training Process

Training loss with different iteration times
  • The training loss goes stable after about 9000 iterations.
  • Therefore, the learning process of the network is stopped after 10k iterations for reliable estimation.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store