Review — MsFEN+MsBEN: Defocus Blur Detection by Fusing Multiscale Deep Features With Conv-LSTM (Blur Detection)

Using VGGNet+Conv-LSTM, Outperforms BTBCRL, DeFusionNet & DHDE

Published in

The Startup

7 min readJan 16, 2021

**An example of defocus blur detection**

In this story, Defocus Blur Detection by Fusing Multiscale Deep Features With Conv-LSTM, MsFEN+MsBEN, by Civil Aviation University of China, is reviewed. In this paper:

Multiscale convolutional features from same image with different sizes are extracted.
Conv-LSTMs are used to integrate the fused features gradually from top-to-bottom layers, and to generate multiscale blur estimations.

This is a paper in 2020 IEEE ACCESS where ACCESS is a open access journal with high impact factor of 3.745. (Sik-Ho Tsang @ Medium)

Outline

Multiscale Feature Extraction Sub-Network (MsFEN)
Multiscale Blur Detection Sub-Network (MsBEN)
Experimental Results

1. Multiscale Feature Extraction Sub-Network (MsFEN)

**Multiscale Feature Extraction Sub-Network (MsFEN) + Multiscale Blur Detection Sub-Network (MsBEN)**

VGG16 is used as the basic feature extractor.
The last pooling layer and last two fully connected layers are removed.
ImageNet Pretrained VGG16 is used for weight intialization.
The remaining five convolutional blocks (conv1-5) are used to extract multiscale convolutional features from input images, as the blue boxes in the above figure. The below table shows the details:

**Details of the feature extraction network**

The features extracted from three scaled images have different spatial resolutions.
Let I denote an input image. I is firstly resized into three different scales, denoted as I1, I2, and I3 with sizes of 320×320, 256×256 and 192×192, respectively.
MsFEN is used to extract multiscale convolutional features Fsconvl from Is.
Bilinear interpolation (BI) is used to upsample to the same size before concatenation. The resized feature can be formulated as:

where 1×1 convolution is used before upsampling to change the channel numbers of the convolutional features to 64.
Batch normalization (BN) and ReLU are applied following the convolutional layers.
F1convl, F2upl, and F3upl are concatenated then fused with a convolutional layer to obtain fused feature Fconvl:

By concatenating the convolutional features, we can obtain more robust features to conquer the blur scale ambiguity.

2. Multiscale Blur Detection Sub-Network (MsBEN)

2.1. MsBEN using Conv-LSTM

With the fused features Fconv1-5, top-to-bottom integration is conducted by feeding them to Conv-LSTMs.
To gradually refine the blur maps, the upsampled blur map at the l+1 layer is concatenated with the fused features of Fconvl , because features of the bottom layer contain more ne structure information.
Then the concatenated features are input into Conv-LSTM to estimate blur map Bl.

where σ is a sigmoid activation to normalize Fconvl into [0,1].
With Conv-LSTM layers, the blur information of the top layers is gradually integrated with features of bottom layers to robustly generate accurate blur detection result.

**Blur maps estimated by Conv-LSTM5 (a) to Conv-LSTM1 (e)**

We can see that the blur maps are refined progressively.

2.2. Conv-LSTM

**Details of the Conv-LSTM. (a) is the procedure of a Conv-LSTM. (b) is the internal computation of Conv-LSTM**

Conv-LSTM is developed from the traditional fully connected LSTM for capturing spatially-related features.
This layer uses convolutional operations instead of dot products to code spatial information.
Similar to LSTM, Conv-LSTM has three gates, namely, input gate it, output gate ot, and forget gate ft, to control the information flow.
Cell status Ct and hidden state Ht are updated by the values of it, ot and ft.
Let Xt denote the input of Conv-LSTM layer at each time step.
If it is activated, then the input will be accumulated to the cell Ct.
If ft is activated, then the past cell status Ct-1 will be neglected.
The information propagation of Ct to Ht is controlled by the output gate ot.
The update process at time step t:

where * is convolution. The max time step is set to 3 in the experiments.
A convolutional layer with the filter of 1×1×1 on the hidden state of last time step to get the estimated blur map.

2.3. Multi-Layer Losses

The blur map at each scale is supervised.
Cross-entropy loss function LC is used.
Precision LP, Recall LR, F-measure LFβ, and MAE LMAE are also used as part of the loss function:

where α1, α2, α3, and α4 are set to 0.1 by experience.

3. Experimental Results

3.1. Synthesized Blur Data

**Examples of synthesized defocus blur images and their corresponding ground truths**

2000 images from Berkeley segmentation dataset (BSDS), uncompressed color image dataset (UCID) and PASCAL 2008 dataset, are used.
Gaussian blur with kernel size 7×7 and σ=2 is applied to blur the top-, bottom-, left-, and right-half image regions, for 5 times.
Thus, each image can generate 20 blurred images.

A total of 40,000 synthetic blur images are obtained for pre-training.

After pre-training, fine-tuning is used on real dataset.

3.2. Qualitative Analysis

**Visual comparison of defocus blur detection results of different blur detectors.**

Scene1 contains a large homogeneous region.
SS and DeFusionNet failed in detecting the large homogeneous regions in the first and third images, respectively.
DHDE cannot suppress the values of the blurred backgrounds in all images.
HiFST blurs the boundaries.
The blur detection results of the proposed method are accurate in homogeneous regions.
Scene2 contains cluster background.
Almost all compared blur detectors treat the bright spots as sharp regions. However, the proposed method accurately detect sharp regions.
In the second image, the compared methods fail to detect the doll’s body, while the proposed method detected the integral sharp body region.
Scene3 has images with similar foreground and background.
BTBCRL fails to detect the real sharp regions from the blurred backgrounds.
HiFST treats the blurred background as sharp.
SS and DHDE misclassify the blurred background into a sharp region.
The proposed method achieves correct blur detection results on two representative images.

3.3. Quantitative Analysis

**Quantitative comparison of F0.3, F1 and MAE scores of different blur detectors**

The first partition of CUHK1 which is the same as DeFusionNet.
The second partition is CUHK2 which is proposed by BTBCRL.
The proposed method achieves the highest F-measure values and the lowest MAE values on the DUT and CUHK2 datasets.
On the CUHK1, the proposed method ranks second place, while DeFusionNet ranks first.

**Comparison of PR curves, Precision, Recall and F-measure**

The Precision, Recall and F-measure values by the proposed method are the highest among the compared methods on both datasets, because the results are close to the binary blur map whose Precision and Recall are not sensitive to the threshold.

3.4. Running Time

**Comparison of running time on an image with size of 320×320**

The proposed method is a bit slower than DeFusionNet.
The proposed method takes 0.06s for processing an image, yet, which is faster than other compared methods.

3.5. Ablation Study

Using deeper networks can improve the performance of defocus blur detection, but also use more computation resources.
It is found that VGG16 has similar performance with VGG19 but takes less running time.
Thus, VGG16 is used as our basic feature extractor.

**Performance of different combinations of image scales**

Using only two scales as inputs, i.e. Net-S², cannot promote the blur detection performance on the DUT dataset.
Meanwhile, Net-S³, i.e. using 3 scales is better than Net-S¹ and Net-S².

**Details of Conv-LSTM, Single-Conv and Multi-Conv**

The first variant is replacing Conv-LSTM with a convolutional layer, named Single-Conv.
The second one is replacing Conv-LSTM with three convolutional layers, which has the same parameter number with the corresponding Conv-LSTM, denoted as Multi-Conv.