Team Members: Ahmet Tarık KAYA, Ayça Meriç ÇELİK, Kaan MERSİN

Kaan Mersin
bbm406f18
5 min readDec 16, 2018

--

Last week, we talked about Pyramid Scene Parsing Network which is a very succesfull model for scene parsing challenge.

This week, we did a detailed research on PSPNet and tried to gain a strong understanding on it. Also we searched for alternative implementations other than official one.

An Overview on PSPNet

The pyramid pooling module fuses features under four different pyramid scales. The coarsest level highlighted in red is global pooling to generate a single bin output. The following pyramid level separates the feature map into dif ferent sub-regions and forms pooled representation for different locations. The output of different levels in the pyramid pooling module contains the feature map with varied sizes. To maintain the weight of global feature, we use 1×1convolution layer after each pyramid level to reduce the dimension of context representation to 1/N of the original one if the level size of pyramid is N. Then we directly up- sample the low-dimension feature maps to get the same size feature as the original feature map via bilinear interpolation. Finally, different levels of features are concatenated as the final pyramid pooling global feature.

Image 1 : Resnet 101

Image 1: The fully convolutional neural networks with average pooling are considered as state-of-the-art image classifiers. Spatial pyramid matching is the method for classical scene understanding. In PSPNet, we will use Fully convolutional Networks with spatial pyramid matching pooling for the sake of better scene recognition.

Pixel-level prediction tasks like scene parsing and semantic segmentation achieve great progress inspired by replacing the fully-connected layer in classification with the convolution layer. To enlarge the receptive field of neural networks, methods used dilated convolution. Their baseline network is Fully Convolutional Networks(FCN) and dilated network.

Other work mainly proceeds in two directions. One line is with multi-scale feature ensembling. The other direction is based on structure prediction. The pioneer work used conditional random field (CRF) as post processing to refine the segmentation result.

There are three common issues for complex-scene parsing.

  1. Mismatched Relationship
  2. Confusion Categories
  3. Inconspicuous Classes

They concluded thath, many errors are partially or completely related to contextual relationship and global information for different receptive fields. Thus a deep network with a suitable global-scene-level prior can much improve the performance of scene parsing.

Image 2 : Bathroom Parsing
Image 3: Hallway Parsing
Image 4: Bedroom Parsing

Pyramid Pooling Module

With above analysis, in what follows, we introduce the pyramid pooling module, which empirically proves to be an effective global contextual prior.

Global average pooling is a good baseline model as the global contextual prior, which is commonly used in image classification tasks. But regarding the complex- scene images in ADE20K, this strategy is not enough to cover necessary information. Pixels in these scene images are annotated regarding many stuff and objects. Directly fusing them to form a single vector may lose the spatial relation and cause ambiguity. Global context information along with sub-region context is helpful in this regard to distinguish among various categories. A more powerful representation could be fused information from different sub-regions with these receptive fields. Similar conclusion was drawn in classical work of scene/image classification.

Image 5: Overview of PSPNet

Image 5: Given an input image (a), we first use CNN to get the feature map of the last convolutional layer (b), then a pyramid parsing module is applied to harvest different sub-region representations, followed by upsampling and concatenation layers to form the final feature representation, which carries both local and global context information in (c.). Finally, the representation is fed into a convolution layer to get the final per-pixel prediction (d).

The pyramid pooling module fuses features under four different pyramid scales. The coarsest level highlighted in red is global pooling to generate a single bin output. The following pyramid level separates the feature map into dif ferent sub-regions and forms pooled representation for different locations. The output of different levels in the pyramid pooling module contains the feature map with varied sizes. To maintain the weight of global feature, we use 1×1convolution layer after each pyramid level to reduce the dimension of context representation to 1/N of the original one if the level size of pyramid is N. Then we directly up- sample the low-dimension feature maps to get the same size feature as the original feature map via bilinear interpolation. Finally, different levels of features are concatenated as the final pyramid pooling global feature.

Deep Supervision for ResNet-Based FCN

They contrarily propose generating initial results by super- vision with an additional loss, and learning the residue af- terwards with the final loss. Thus, optimization of the deep network is decomposed into two, each is simpler to solve.

Table 1

Table 1: Better and Deeper pretrained model improves the result consistently.

Table 2

Table 2: PSPNet performance on ADE20K validation set. Number in the brackets refers to the depth of pre-trained ResNet and ‘MS’ denotes multi-scale testing. Fully Convolutional Networks(FCN) with Average Pooling is considered as State-of-the-Art Image classification method. Spatial Pyramid Matching is considered as Classical Scene Understanding method.

Other Implementations

Normally, PSPNet is built on Caffe framework. However, there is another unofficial implementations available.

Image 6 : Keras Tensorflow
  • We found Keras implementation, which gives remarkable results as well.
  • We also found a couple of Tensorflow implementations.
  • Lastly, there are several Pytorch implementations as well.

Our Implementation

Firstly, we are going to review pre-trained models for these. After that, we are going to train our models on these different implementations by using subset of ADE20K dataset, which only contains indoor images of houses. After that, we will analyze the results of each of them and compare their performances.

This is all from us for this week. Stay tuned for updates!

Resources

--

--