Learning Day 68: Semantic segmentation 2 — DeepLab, atrous/dilated convolution

De Jun Huang
dejunhuang
Published in
3 min readJun 23, 2021

Background

  • For FCN in Day 67, it still suffers from the problem of big-step upsampling from small feature maps to the final output.
  • DeepLab aims to solve this problem to make the object boundary more accurate

DeepLab v1

  • CNN + CRF
  • Use Atrous/Dilated convolution at the deeper layers in CNN

Atrous/Dilated convolution

  • As compared to using a 3x3 filter, holes are inserted in-between the filter to make it cover an area of 5x5
Atrous/Dilated convolution using a dilated 3x3 filter (effective coverage is 5x5) (ref)
  • With the similar amount of weights, the field of view is bigger
  • Use it with the appropriate stride to replace upsampling deconvolution layer. The resultant feature map has more details by the below comparison.
Compare standard convolution with down- and up-sampling (top) with atrous convolution (bottom) (ref)
  • A concept called dilation rate. The amount of holes inserted in-between=rate-1. Eg. Rate=2, no. of holes to be inserted=1. However, it cannot be too big. If rate ≥the input size, it is similar to doing convolution with 1x1 filter.

CRF (Conditional Random Field)

  • Take the rough segmentation results from CNN and refine the boundary using fully connected CRF
  • I don’t quite understand the theory of CRF

DeepLab v2

  • The base model can be VGG16 or ResNet101
  • Introduce Atrous Spatial Pyramid Pooling (ASPP)

Atrous Spatial Pyramid Pooling (ASPP)

  • Use different dilation rate to capture features at different scales
ASPP (ref)
An illustration of ASPP at Conv6 layer (right after Pool5) (ref)

DeepLab v3

  • More universal as it can use any CNN structure as a backbone
  • Use batch-norm in ASPP
  • No CRF
  • Use “series” and “parallel” connections of atrous convolution layers
Deep CNN with atrous convolution in “series” connection or in cascade (ref)
ASPP in “parallel” connection (ref)

DeepLab v3+

  • Expanded on v3
  • added a encoder-decoder structure to conserve boundary information
  • The original DeepLab v3 is used as the encoder to apply atrous convolution at multiple scales
  • Decoder gets the low-level features from the backbone model and concat with output from encoder after some conv layers and upsampling
Encoder-decoder structure in DeepLab v3+ (ref)

Reference

link1

--

--