Review: ParseNet — Looking Wider to See Better (Semantic Segmentation)

In this story, ParseNet is shortly reviewed. By using ParseNet, global context is added and the accuracy is improved. This is a 2016 ICLR paper with more than 200 citations when I was writing this story. (Sik-Ho Tsang @ Medium)

Example of ParseNet

By using ParseNet, the cat in the above image is not going to be wrongly classified as bird, dog or sheep. It is mentioned that the global context can help to classify the local patches. Let’s see how it works.

What Are Covered

  1. ParseNet Module
  2. Results

1. ParseNet Module

ParseNet Module

Actually, ParseNet is simple as in the figure above.

Normalization Using l2 Norm for each channel

At the lower path, at certain conv layer, Normalization using l2 norm is performed for each channel.

At the upper path, at certain conv layer, we perform global average pooling of those feature maps at that conv layer, and perform normalization using l2 norm. Unpooling is just replicating that values of that global averaged pooled vector to be the same size with the lower path so that they can be concatenated.

Features are in different scale at different layers

The reason of having the L2 norm is that, because the earlier layers usually have larger values than the later layers.

The above example showing that the features at different layers have different scales of values. After normalization, all features will have the same value range. And they are all concatenated together.

And a learnable scaling factor γ for each channel is also introduced after normalization:

3. Results

3.1. SiftFlow

SiftFlow Dataset

By adding and normalizing pool6 + fc7 + conv5 + conv4 using ParseNet module to FCN-32s, 40.4% mean IOU is obtained which is better than FCN-16s.

3.2. PASCAL Context

PASCAL Context Dataset

By adding and normalizing pool6 + fc7 + conv5 + conv4 + conv3 using ParseNet module to FCN-32s, 40.4% IOU is obtained which is better than FCN-8s.

We can also see that, without normalization, it does not work well for the ParseNet module.

3.3. PASCAL VOC 2012

PASCAL VOC 2012 Dataset
  • ParseNet Baseline: It is DeepLabv1 without CRF, 67.3%
  • ParseNet: ParseNet Baseline with ParseNet module, 69.8%
  • DeepLab-CRF-LargeFOV: DeepLabv1, 70.3%

Though ParseNet has lower performance than DeepLab-CRF-LargeFOV, it is still competitive, and it is end-to-end learning framework while CRF is a post-processing step which makes DeepLabv1 not.

Some Qualitative Results
Failure Cases

We can see that failure cases above always happen when the object consists of more than one colors, or being occluded.

Though it has lower performance than DeepLabv1 (with CRF), it is used in DeepLabv3 and DeepLabv3+.


My Related Reviews

[FCN] [DeconvNet] [DeepLabv1 & DeepLabv2]