Review — EncNet: Context Encoding for Semantic Segmentation (Semantic Segmentation)

Using Context Encoding Module, Outperforms PSPNet and DeepLabv3, FCN, DilatedNet, DeepLabv2, CRF-RNN, DeconvNet, DPN, RefineNet & ResNet-38

Sik-Ho Tsang
Nerd For Tech
Published in
6 min readMay 2, 2021

--

Narrowing the list of probable categories based on scene context makes labeling much easier.

In this story, Context Encoding for Semantic Segmentation, (EncNet), by Rutgers University, Amazon Inc, SenseTime, and The Chinese University of Hong Kong, is reviewed. In this paper:

  • Context Encoding Module is introduced, which captures the semantic context of scenes and selectively highlights class-dependent feature maps.
  • For example in the above figure, the suite room scene will seldom have a horse, but more likely there will be chair, bed and curtain, etc. In this case, this module helps to highlight chair, bed and curtain.

This is a paper in 2018 CVPR with over 500 citations. (Sik-Ho Tsang @ Medium)

Outline

  1. Context Encoding Module
  2. Semantic Encoding Losses
  3. Semantic Segmentation Results
  4. Image Classification Results

1. Context Encoding Module

Context Encoding Module & Semantic Encoding Losses (SE-loss)

1.1. Overall Architecture

  • Given an input image, a pre-trained ResNet is used to extract dense convolutional feature maps with the size of C×H×W.
  • The proposed Context Encoding Module on top, including an Encoding Layer is designed to capture the encoded semantics and predict scaling factors that are conditional on these encoded semantics.
  • These learned factors selectively highlight class-dependent feature maps (visualized in colors).
  • In another branch, a Semantic Encoding Loss (SE-loss) to regularize the training which lets the Context Encoding Module predict the presence of the categories in the scene.
  • Finally, the representation of Context Encoding Module is fed into the last convolutional layer to make per-pixel prediction.

1.2. Encoding Layer (Proposed by Deep TEN)

Encoding Layer (Deep Ten). In this paper, descriptors are the input feature maps
  • Encoding Layer considers an input feature map with the shape of C×H×W as a set of C-dimensional input features X = {x1, …, xN}, where N is total number of features given by H×W, which learns an inherent codebook D = {d1, …, dK} containing K number of codewords (visual centers) and a set of smoothing factors of the visual centers S = {s1, …, sK}.
  • In this paper, the number of codewords K is 32 in Encoding Layers.
  • First, the residual rik is obtained, by substracting each xi from each dk:
  • Consider the assigning weights for assigning the descriptors to the codewords. Hard-assignment provides a single non-zero assigning weight for each descriptor xi, which corresponds to the nearest codeword.
  • Hard-assignment doesn’t consider the codeword ambiguity and also makes the model non-differentiable.
  • Soft-weight assignment addresses this issue by assigning a descriptor to each codeword.
  • where β is the smoothing factor.
  • Indeed, the smoothing factor can be learnable, i.e. sk:
  • By aggregation as below, ek is obtained:
  • where N is total number of features given by H×W, as mentioned above.
  • Then, aggregation is applied:
  • where ϕ is batch normalization and ReLU.

1.3. Feature Map Attention

  • A fully connected layer on top of the Encoding Layer and a sigmoid as the activation function are used, which outputs predicted feature map scaling factors:
  • where W denotes the layer weights and 𝛿 is the sigmoid function.
  • A channel wise multiplication ⊗ is applied between input feature maps X and scaling factor γ, to obtain the module output:
  • The output predictions are upsampled 8 times using bilinear interpolation for calculating the loss.

As an intuitive example of the utility of the proposed approach, consider emphasizing the probability of an airplane in a sky scene, but de-emphasizing that of a vehicle.

2. Semantic Encoding Losses

Dilation strategy and losses
  • The network may have difficulty understanding context without global information.

Semantic Encoding Loss (SE-loss) forces the network to understand the global semantic information with very small extra computation cost.

  • An additional fully connected layer with a sigmoid activation function is built on top of the Encoding Layer to make individual predictions for the presences of object categories in the scene and learn with binary cross entropy loss.
  • Unlike per-pixel loss, SE-loss considers big and small objects equally. Thus, the segmentation of small objects are often improved.
  • The SE-losses are added to both stage 3 and 4 of the base network.
  • The ground truths of SE-loss are directly generated from the ground-truth segmentation mask without any additional annotations.

3. Semantic Segmentation Results

3.1. Ablation Study on PASCAL-Context

Ablation Study on PASCAL-Context dataset
  • Comparing to baseline FCN, simply adding a Context Encoding Module on top yields results of 78.1/47.6 (pixAcc and mIoU),
The effect of weights of SE-loss α
  • To study the effect of SE-loss, we test different weights of SE-loss α={0.0, 0.1, 0.2, 0.4, 0.8}, and we find α = 0.2 yields the best performance.
The effect of number of codewords K
  • Also, beyond the number of codewords K = 32, the improvement gets saturated (K = 0 means using global average pooling instead).
  • Deeper pre-trained network provides better feature representations, EncNet gets additional 2.5% improvement in mIoU employing ResNet-101.
  • Finally, multi-size evaluation yields our final scores of 81.2% pixAcc and 52.6% mIoU, which is 51.7% including background.

3.2. Results on PASCAL-Context

Segmentation results on PASCAL-Context dataset
  • The proposed EncNet outperform previous state-of-the-art approaches without using COCO pre-training or deeper model (ResNet-152).

3.3. Results on PASCAL VOC 2012

Results on PASCAL VOC 2012 testing set

3.4. Results on ADE20K

Segmentation results on ADE20K validation set
  • EncNet-101 achieves comparable results with state-of-the-art PSPNet-269 using much shallower base network.
Result on ADE20K test set
  • The EncNet achieves a final score of 0.55675, which surpass PSP-Net-269 (1st place in 2016) and all entries in COCO Place Challenge 2017.

4. Image Classification Results

Comparison of model depth, number of parameters (M), test errors (%) on CIFAR-10
  • The Context Encoding Module can also be plugged into image classification network.
  • A shallow 14-layer ResNet is used as baseline.
  • SE module, in SENet, is added on top of each Resblock.
  • Similarly, the proposed Context Encoding Module can also be added on top of each Resblock.
  • A shallow network of 14 layers with Context Encoding Module has achieved 3.45% error rate on CIFAR10 dataset as shown in the above table, which is comparable performance with state-of-the art approaches such as WRN, ResNeXt, and DenseNet.

--

--

Sik-Ho Tsang
Nerd For Tech

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.