Review —DFN: Discriminative Feature Network (Semantic Segmentation)

With Smooth Network & Border Network, Outperforms DeepLabv3+, PSPNet, ResNet-38, RefineNet, GCN, DUC, DeepLabv2, ParseNet, DPN, FCN.

Sik-Ho Tsang
Nerd For Tech
6 min readApr 25, 2021

--

Hard examples in semantic segmentation

In this story, Learning a Discriminative Feature Network for Semantic Segmentation, (DFN), by Huazhong University of Science and Technology, Peking University, and Megvii Inc. (Face++), is reviewed. In this paper:

  • Discriminative Feature Network (DFN) has 2 sub-networks.
  • One is the Smooth Network, to handle the intra-class inconsistency problem with Channel Attention Block and global average pooling to select the more discriminative features.
  • One is the Border Network, to make the bilateral features of boundary distinguishable with deep semantic boundary supervision.

This is a paper in 2018 CVPR with over 360 citations. (Sik-Ho Tsang @ Medium)

Outline

  1. DFN: Network Architecture
  2. Smooth Network
  3. Border Network
  4. Ablation Study
  5. Experimental Results

1. DFN: Network Architecture

DFN: Overall Network Architecture
  • ImageNet pre-trained ResNet-101 is used. FCN4 is used as the base segmentation framework.
  • There are multiple stages at ResNet due to different feature map sizes.
  • The loss function composes of the segmentation loss ls by Smooth Network and the boundary loss lb by Border Network:
  • These 2 networks or losses would be mentioned below.

2. Smooth Network

  • As shown in the figure of Overall Network Architecture, Smooth network composed of Channel Attention Blocks (CABs) and Refinement Residual Block (RRBs).

2.1. Channel Attention Blocks (CABs)

Channel Attention Block (CAB)
  • In the FCN architecture, the convolution operator outputs a score map, which gives the probability of each class at each pixel.
  • The final score at score map is just summed over all channels of feature maps:
  • where x is the output feature of network, and w represents the convolution kernel. K is the number of channels. D is the set of pixel positions.

However, the above equation implicitly indicates that the weights of different channels are equal.

  • With CAB, α are the weights estimated by CAB based on feature map responses as shown in the figure above:
  • where α is sigmoided before multiplying with input:
  • This idea is originated in SENet. (Please feel free to read SENet if interested.)

By using CAB, the intra-class consistent prediction is obtained that the discriminative features are extracted and the indiscriminative features are inhibited.

2.2. Refinement Residual Blocks (RRBs)

Refinement Residual Block (RRB)
  • The first component of the RRB is a 1×1 convolution layer. The number of channels is unified to 512. Meanwhile, it can combine the information across all channels.
  • Then the following is a basic residual block, which can refine the feature map.

Thus, this block can strengthen the recognition ability of each stage, inspired from the architecture of ResNet.

3. Border Network

  • Border Network, as shown in the figure of Overall Network Architecture, is used to enlarge the inter-class distinction of features.
  • To extract the accurate semantic boundary, the explicit supervision of semantic boundary is applied, to make the network learn a feature with strong inter-class distinctive ability.
  • This network can simultaneously get accurate edge information from low stage and obtain semantic information from high stage, which eliminates some original edges lack of semantic information.
  • The semantic information of high stage can refine the detailed edge information from low stage stage-wise.
  • The supervisory signal of this network is obtained from the semantic segmentation’s groundtruth with a traditional image processing method, such as Canny.
  • To remedy the imbalance of the positive and negative samples, Focal loss in RetinaNet is used:
  • where pk is the estimated probability for class k.

4. Ablation Study

4.1. Smooth Network

Baseline Performance on PASCAL VOC 2012
  • The performance of the base ResNet-101 is as shown above.
  • With 5 scales {0.5, 0.75, 1, 1.5, 1.75}, 72.86% mIOU is achieved on PASCAL VOC 2012.
Detailed performance comparison of the proposed Smooth Network. RRB: refinement residual block. GP: global pooling branch. CAB: channel attention block. DS: deep supervision.
  • RRB: With the proposed RRB, which improves the performance from 72.86% to 76.65%.
  • GP: The global average pooling introduces the strongest consistency to guide other stages. This improves the performance from 76.65% to 78.20%, which is an obvious improvement.
  • DS: With deep supervision, this further improves the performance by almost 0.4%.
  • CAB: CAB utilizes the high stage to guide the low stage with a channel attention vector to enhance consistency, which improves the performance from 78.51% to 79.54%.
Results of Smooth Network on PASCAL VOC 2012

4.2. Border Network

Combining the Border Network and Smooth Network as Discriminative Feature Network. SN: Smooth Network. BN: Border Network. MS Flip: Adding multi-scale inputs and left-right flipped inputs.
  • With Border Network integrated into the Smooth Network, this improves the performance from 79.54% to 79.67%.
  • The Border Network optimizes the semantic boundary, which is a comparably small part of the whole image, so this design makes a minor improvement.
The boundary on prediction is refined by the Border Network.
  • As shown above, the Border Network not only refined the boundaries, but also the predictions.
The boundary prediction of Border Network on PASCAL VOC 2012 dataset
  • As shown above, the third column is the semantic boundary extracted from Ground-truth by Canny operator.
  • The last column is the prediction results of Border Network.

4.3. Discriminative Feature Network (DFN)

  • Different balance values are tested for the total loss.
  • λ=0.1 is selected as it got the best results.
Stage-wise refinement process on PASCAL VOC 2012 dataset.
  • The segmentation prediction in lower stage is more spatial coarse, and the higher is finer.
  • While the boundary prediction in lower stage contains more edges not belong to semantic boundary, the semantic boundary in higher stage is more pure.

5. Experimental Results

5.1. PASCAL VOC 2012

Validation strategy on PASCAL VOC 2012 dataset. MS Flip: Multi-scale and flip evaluation.
  • Train_data: The model is further fined-tuned on PASCAL VOC 2012 train set.
  • MS_Flip: The multi-scale inputs {0.5, 0.75, 1, 1.5, 1.75} and and also horizontally flip is applied.
  • 80.6% is obtained on validation set for DFN.
Performance on PASCAL VOC 2012 test set
  • PASCAL VOC 2012 trainval set is used to further fine-tune the proposed method. In the end, DFN respectively achieves performance of 82.7% and 86.2% with and without MS-COCO fine-tuning, as shown above.
  • Note that, Dense-CRF postprocessing in DeepLabv1 is not used for DFN.
  • Finally, DFN outperforms DeepLabv3+, PSPNet, ResNet-38, RefineNet, GCN, DUC, DeepLabv2, ParseNet, DPN, FCN.

5.2. Cityscapes

Performance on Cityscapes test set
Example results of DFN on Cityscapes dataset.

--

--

Sik-Ho Tsang
Nerd For Tech

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.