Nerd For Tech
Published in

Nerd For Tech

Review — EncNet: Context Encoding for Semantic Segmentation (Semantic Segmentation)

Using Context Encoding Module, Outperforms PSPNet and DeepLabv3, FCN, DilatedNet, DeepLabv2, CRF-RNN, DeconvNet, DPN, RefineNet & ResNet-38

Narrowing the list of probable categories based on scene context makes labeling much easier.

In this story, Context Encoding for Semantic Segmentation, (EncNet), by Rutgers University, Amazon Inc, SenseTime, and The Chinese University of Hong Kong, is reviewed. In this paper:

  • Context Encoding Module is introduced, which captures the semantic context of scenes and selectively highlights class-dependent feature maps.
  • For example in the above figure, the suite room scene will seldom have a horse, but more likely there will be chair, bed and curtain, etc. In this case, this module helps to highlight chair, bed and curtain.

This is a paper in 2018 CVPR with over 500 citations. (Sik-Ho Tsang @ Medium)


  1. Context Encoding Module
  2. Semantic Encoding Losses
  3. Semantic Segmentation Results
  4. Image Classification Results

1. Context Encoding Module

Context Encoding Module & Semantic Encoding Losses (SE-loss)

1.1. Overall Architecture

  • Given an input image, a pre-trained ResNet is used to extract dense convolutional feature maps with the size of C×H×W.
  • The proposed Context Encoding Module on top, including an Encoding Layer is designed to capture the encoded semantics and predict scaling factors that are conditional on these encoded semantics.
  • These learned factors selectively highlight class-dependent feature maps (visualized in colors).
  • In another branch, a Semantic Encoding Loss (SE-loss) to regularize the training which lets the Context Encoding Module predict the presence of the categories in the scene.
  • Finally, the representation of Context Encoding Module is fed into the last convolutional layer to make per-pixel prediction.

1.2. Encoding Layer (Proposed by Deep TEN)

Encoding Layer (Deep Ten). In this paper, descriptors are the input feature maps
  • Encoding Layer considers an input feature map with the shape of C×H×W as a set of C-dimensional input features X = {x1, …, xN}, where N is total number of features given by H×W, which learns an inherent codebook D = {d1, …, dK} containing K number of codewords (visual centers) and a set of smoothing factors of the visual centers S = {s1, …, sK}.
  • In this paper, the number of codewords K is 32 in Encoding Layers.
  • First, the residual rik is obtained, by substracting each xi from each dk:
  • Consider the assigning weights for assigning the descriptors to the codewords. Hard-assignment provides a single non-zero assigning weight for each descriptor xi, which corresponds to the nearest codeword.
  • Hard-assignment doesn’t consider the codeword ambiguity and also makes the model non-differentiable.
  • Soft-weight assignment addresses this issue by assigning a descriptor to each codeword.
  • where β is the smoothing factor.
  • Indeed, the smoothing factor can be learnable, i.e. sk:
  • By aggregation as below, ek is obtained:
  • where N is total number of features given by H×W, as mentioned above.
  • Then, aggregation is applied:
  • where ϕ is batch normalization and ReLU.

1.3. Feature Map Attention

  • A fully connected layer on top of the Encoding Layer and a sigmoid as the activation function are used, which outputs predicted feature map scaling factors:
  • where W denotes the layer weights and 𝛿 is the sigmoid function.
  • A channel wise multiplication ⊗ is applied between input feature maps X and scaling factor γ, to obtain the module output:
  • The output predictions are upsampled 8 times using bilinear interpolation for calculating the loss.

As an intuitive example of the utility of the proposed approach, consider emphasizing the probability of an airplane in a sky scene, but de-emphasizing that of a vehicle.

2. Semantic Encoding Losses

Dilation strategy and losses
  • The network may have difficulty understanding context without global information.

Semantic Encoding Loss (SE-loss) forces the network to understand the global semantic information with very small extra computation cost.

  • An additional fully connected layer with a sigmoid activation function is built on top of the Encoding Layer to make individual predictions for the presences of object categories in the scene and learn with binary cross entropy loss.
  • Unlike per-pixel loss, SE-loss considers big and small objects equally. Thus, the segmentation of small objects are often improved.
  • The SE-losses are added to both stage 3 and 4 of the base network.
  • The ground truths of SE-loss are directly generated from the ground-truth segmentation mask without any additional annotations.

3. Semantic Segmentation Results

3.1. Ablation Study on PASCAL-Context

Ablation Study on PASCAL-Context dataset
  • Comparing to baseline FCN, simply adding a Context Encoding Module on top yields results of 78.1/47.6 (pixAcc and mIoU),
The effect of weights of SE-loss α
  • To study the effect of SE-loss, we test different weights of SE-loss α={0.0, 0.1, 0.2, 0.4, 0.8}, and we find α = 0.2 yields the best performance.
The effect of number of codewords K
  • Also, beyond the number of codewords K = 32, the improvement gets saturated (K = 0 means using global average pooling instead).
  • Deeper pre-trained network provides better feature representations, EncNet gets additional 2.5% improvement in mIoU employing ResNet-101.
  • Finally, multi-size evaluation yields our final scores of 81.2% pixAcc and 52.6% mIoU, which is 51.7% including background.

3.2. Results on PASCAL-Context

Segmentation results on PASCAL-Context dataset
  • The proposed EncNet outperform previous state-of-the-art approaches without using COCO pre-training or deeper model (ResNet-152).

3.3. Results on PASCAL VOC 2012

Results on PASCAL VOC 2012 testing set

3.4. Results on ADE20K

Segmentation results on ADE20K validation set
  • EncNet-101 achieves comparable results with state-of-the-art PSPNet-269 using much shallower base network.
Result on ADE20K test set
  • The EncNet achieves a final score of 0.55675, which surpass PSP-Net-269 (1st place in 2016) and all entries in COCO Place Challenge 2017.

4. Image Classification Results

Comparison of model depth, number of parameters (M), test errors (%) on CIFAR-10
  • The Context Encoding Module can also be plugged into image classification network.
  • A shallow 14-layer ResNet is used as baseline.
  • SE module, in SENet, is added on top of each Resblock.
  • Similarly, the proposed Context Encoding Module can also be added on top of each Resblock.
  • A shallow network of 14 layers with Context Encoding Module has achieved 3.45% error rate on CIFAR10 dataset as shown in the above table, which is comparable performance with state-of-the art approaches such as WRN, ResNeXt, and DenseNet.

NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To know more about us, visit

Recommended from Medium

Face Mask Detection System using Artificial Intelligence

YOLOv4 — Version 2: Bag of Specials

The beginner’s guide to implementing YOLO (v3) in TensorFlow 2.0 (Part-2)

How long dependencies can LSTM & T-CNN really remember?

What About a 6-Week Machine Learning Project?

Image Augmentation in Numpy. The spell is simple but quite unbreakable.

[Paper] SACONVA: Shearlet- and CNN-based NR VQA (Video Quality Assessment)

Review — Comparative Study of Classifiers for Blurred Images (Blur Classification)

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Sik-Ho Tsang

Sik-Ho Tsang

PhD, Researcher. I share what I've learnt and done. :) My LinkedIn:, My Paper Reading List:

More from Medium

Contrastive Representation Learning — A Comprehensive Guide (part 1, foundations)

Using pre-trained Vision Transformer model and ResNet model as features extractors for image…

Review — Motion Masks: Learning Features by Watching Objects Move

ViT — An Image is worth 16x16 words: Transformers for Image Recognition at scale — ICLR’21