Review: DeconvNet — Unpooling Layer (Semantic Segmentation)
In this story, DeconvNet is briefly reviewed, the deconvolution network (DeconvNet) is composed of deconvolution and unpooling layers.
For the conventional FCN, the output is obtained by high ratio (32×, 16× and 8×) upsampling, which might induce rough segmentation output (label map). In this DeconvNet, the output label map is obtained by gradual deconvolution and unpooling. And it is a paper published in 2015 ICCV with more than 1000 citations when I was writing this story. (Sik-Ho Tsang @ Medium)
What Are Covered
- Unpooling and Deconvolution
- Instance-wise Segmentation
- Two-Stage Training
- Results
1. Unpooling and Deconvolution
The following is the overall architecture of DeconvNet:
As we can see, it uses VGGNet as backbone. The first part is a convolution network which is as usual like FCN, with conv and pooling layers. The second part is the deconvolution network which is a novel part in this paper.
To perform unpooling, we need to remember the position of each maximum activation value when doing max pooling, as shown above. Then, the remembered position is used for unpooling as shown above.
Deconvolution is just to conv the input back to larger size. (If interested, please read my FCN review for details.)
The above figure is an example. (b) is the output at 14×14 deconv layer. (c) is the output after unpooling, and so on. And we can see in (j) that the bicycle can be reconstructed at the last 224×224 deconv layer, which shows that the learned filters can capture class-specific shape information.
Other examples as shown above which shows that DeconvNet can reconstruct a better shape than FCN-8s.
2. Instance-wise Segmentation
As shown above, the object that is substantially larger or smaller than the receptive field may be fragmented or mislabeled. Small objects are often ignored and classified as background
The semantic segmentation is posed as instance-wise segmentation problem. First of all, Top 50 out of 2000 region proposals (bounding boxes), are detected by an object detection approach, EdgeBox. Then, DeconvNet is applied for each proposal, and aggregates the outputs of all proposals back to the original image. By using proposal, various scales can be handled effectively.
3. Two-Stage Training
First-Stage Training
Crop the object instances using ground-truth annotations so that the object is centered at the cropped bounding box, and then perform training. This can help to reduce the variations in object location and size.
Second-Stage Training
More challenging examples are used. These examples are generated / cropped by the proposals overlapping the ground-truth segmentation.
Some Other Details
- Batch Normalization is used.
- The conv part is initialized using the weights in VGGNet.
- The deconv part is initialized with zero-mean and Gaussians.
- 64 samples per batch.
4. Results
- FCN-8s: has only 64.4% mean IoU.
- DeconvNet: 69.6%
- DeconvNet+CRF: 70.5% (where CRF is just a post-processing step)
- EDeconvNet: 71.5% (EDeconvNet means the results ensembled with FCN-8s)
- EDeconvNet+CRF: 72.5% which has the highest mean IoU.
From the above figure, instance-wise segmentation helps to have segmentation gradually instance-by-instance, not to have segmentation for all instances at once.
It should be noted that the gain of DeconvNet is not just coming from gradual deconv and unpooling, but maybe also from the instance-wise segmentation and two-stage training.
EConvNet+CRF usually has good results even when it is worse than FCN.