Review — DeFusionNet: Defocus Blur Detection via Recurrently Fusing and Refining Multi-Scale Deep Features (Blur Detection)

Outperforms Park CVPR’17 / DHCF / DHDE and BTBNet

Sik-Ho Tsang
The Startup
6 min readJan 9, 2021

--

Some challenging cases for defocus blur detection

In this story, DeFusionNET: Defocus Blur Detection via Recurrently Fusing and Refining Multi-scale Deep Features, DeFusionNET, by China University of Geosciences, Zhejiang Normal University, National University of Defense Technology, and University of Sydney, is reviewed. In this paper:

  • A fully convolutional network is used to extract multi-scale deep features. These features from different layers are fused as shallow features and semantic features.
  • The feature fusing and refining are carried out in a recurrent manner. Finally, the output of each layer at the last recurrent step is fused to obtain the final defocus blur map.

This is a paper in 2019 CVPR with 20 citations. (Sik-Ho Tsang @ Medium)

Outline

  1. DeFusionNET: Network Architecture
  2. Feature Fusing and Refining Module (FFRM)
  3. Defocus Maps Fusing
  4. Ablation Analysis
  5. Experimental Results

1. DeFusionNET: Network Architecture

DeFusionNET: Network Architecture
  • For a given image, we first extract its multi-scale features by using the basic VGG network.
  • Then the features from shallow layers and deep layers are fused as FSHF and FSEF, respectively. Considering the complementary information between FSHF and FSEF, they are used to refine the features of deep and shallow layers in a cross-layer manner.
  • The feature fusion and refinement are performed step by step in a recurrent manner to alternatively refine FSHF, FSEF and the features at each layer. The number of recurrent step is 3.
  • In addition, the deep supervision mechanism is imposed at each step and the prediction result of each layer are fused to obtain the final defocus blur map.

2. Feature Fusing and Refining Module (FFRM)

Feature fusing and refining module (FFRM)
  • The pre-trained VGG16 model is used as backbone feature extraction network.
  • There are n total layers in the network, the first m layers are regarded as shallow layers and the rest ones are as deep layers.
  • Specifically, conv1_2, conv2_2, conv3_3, conv4_3, conv5_3 and pool5 of the VGG network are used. n=6 and m=3. In order to enhance the discrimination capability of feature maps at each layer, two more convolutional layers are appended.
  • The feature maps at shallow layers are first upsampled to the size of input image by using the deconvolution operation. and are concatenated together.
  • Then, a convolution layer with 1×1 kernel is used to generate fused shallow features (FSHF):
  • Similarly, the high-level semantic features at deep layers are fused to form fused semantic features (FSEF):

FSHF encodes the fine details while FSEF captures more semantic information of image contents.

FSEF can be used to help them better locate semantic defocus blur regions. Similarly, as the features of deep layers capture semantic information but lack of fine details, the FSHF can be used to promote the fine details preservation.

  • However, directly fusing FSHF and FSEF not only provides wrong guidance for defocus blur region detection, but also harms the useful information.
  • Thus, they are recurrently fused and refined in a cross-layer manner.
  • Yet, before the next aggregation, the number of feature channels is reduced to the original number first. A convolution is used to do that. The refined feature maps of each layer at the j-th recurrent step is:
  • where Fji represents the feature maps for the i-th layer at the j-th recurrent step. FSEFj and FSHFj represent the FSEF and FSHF at the j-th recurrent step, respectively.

3. Defocus Maps Fusing

  • BTBNet proposed to use a multi-stream strategy to fuse the detection results from different image scales. However, this inevitably increase the computational burden.
  • Here, a supervision signal is proposed to each layer by using the deeply supervised mechanism at each recurrent step, then the output score maps of all the layers at the last step are fused to generate the final defocus blur map.
  • Specifically, the defocus blur maps predicted from n different layers are first concatenated, then a convolution layer is imposed on the concatenated maps to obtain the final output defocus blur map B:
  • where t denotes the last recurrent step; Bti denotes the predicted defocus blur map from the i-th layer at the t-th step.
  • For the i-th layer at the j-th recurrent step, the pixel-wise cross entropy loss between Bji and the ground truth blur mask G is calculated:
  • where 1(·) is the indicator function.
  • The final loss function is defined as the loss summation of all immediate predictions:
  • All the weights are set to 1 empirically.
  • 11.6 hours are used for training with 10k iterations.
  • Only approximately 0.056s is needed for the inference of 320×320 pixels by using a single Nvidia GTX Titan Xp GPU.

4. Ablation Analysis

4.1. Effectiveness of FFRM

Ablation analysis using F-measure and MAE scores
  • DeFusionNet with FFRM module performs better than DeFusionNet_noFFRM, which demonstrates that the cross-layer feature fusion manner can effectively capture the complementary information between shallow features and deep semantic features.
  • In addition, DeFusionNet_noFFRM also performs better than other previous methods.

4.2. Effectiveness of the Final Defocus Maps Fusion

  • The output of different layers at the last recurrent step are fused to form the final result.
  • The final outputs of all the layers are represented as DeFusionNet_O1, DeFusionNet_O2, DeFusionNet_O3, DeFusionNet_O4, DeFusionNet_O5, DeFusionNet_O6.
  • It can be seen in the above table that the fusing mechanism effectively improves the final results.

4.3. Effectiveness of the Times of Recurrent Steps

Ablation analysis of the times of recurrent steps
  • The more times of recurrent step, the better results can be obtained.
  • DeFusionNet can obtain relatively stable results when the times of recurrent is 3.

5. Experiment Results

5.1. Quantitative Comparison

Quantitative comparison of F-measure and MAE scores
Precision-recall curves and F-measure curves on Shi’s dataset
Precision-recall curves and F-measure curves on DUT dataset
  • DeFusionNet also consistently outperforms other counterparts for PR curves and F-measure curves on two datasets.

5.2. Qualitative Comparison

Visual comparison of detected defocus blur maps generated from different methods
  • DefusionNet generates more accurate defocus blur maps when the input image contains in-focus smooth regions and background clutter.
  • In addition, the boundary information of the in-focus objects can be well preserved.

5.3. Running Efficiency Comparison

Average running time (seconds) for an image of different methods on different datasets
  • DeFusionNet is faster than all of other methods.

Authors also extend DeFusionNet to TPAMI paper. Hope I can review later.

--

--

Sik-Ho Tsang
The Startup

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.