Image Segmentation — Semantic Segmentation(1)

Published in

謦伊的閱讀筆記

6 min readFeb 21, 2022

This article is the English version of the reference: 影像分割 Image Segmentation — 語義分割 Semantic Segmentation(1)

Several important tasks of deep learning in the field of Computer Vision are Image Classification, Object Detection, Image Segmentation. Image Segmentation is an algorithm that uses pixels for detection and classification, which can applied to some tasks such as beauty makeup, portrait photograph, autonomous driving, biomedicine, animal husbandry, and so forth.

There are three different algorithms for Image Segmentation algorithm: Semantic Segmentation, Instance Segmentation and Panoramic Segmentation.

Semantic Segmentation: The method is to classify all the pixels in the image.
Instance Segmentation: It is a combination of object detection algorithm and semantic segmentation algorithm, which is relatively difficult. The method is to classify the pixels of interest in the image and box-select the location of each object. It should be noted that the same category will also be divided into different objects. The difference between the semantic segmentation algorithm and the instance segmentation algorithm can be clearly seen from the figure.
Panoramic Segmentation: It is a combination of semantic segmentation algorithm and instance segmentation algorithm. The method is to segment and divide each pixel into different objects, while also taking the background into account.

Next, I will introduce the algorithm, the paper and code for semantic segmentation.

Semantic segmentation

The representative algorithms are divided into the following sections, and some of the more well-known models will be introduced.

Based on Convolutional Neural Network (CNN): FCN, DeconvNet, U-Net, SegNet, DeepLab, RefineNet, PSPNet, GSCNN, etc.
Based on Recurrent Neural Network (RNN): ReNet, ReSeg, etc.
Based on Generative adversarial network (GAN): pix2pix, Probalistic Unet, etc.
Based on Transformer: HRNet, OCRNet, HRNet-OCR, Point Transformer, SETR, etc.

Fully Convolutional Networks (FCN，2014)

FCN is the first developed model for image segmentation, which lays an important foundation for semantic segmentation tasks.

For the detailed, please refer to this article: Understanding and implementing a fully convolutional network (FCN)

DeconvNet (2015)

🔖 Github: https://github.com/HyeonwooNoh/DeconvNet

DeconvNet is improved based on FCN. The model structure is composed of convolution and deconvolution networks (It is also called Encoder-Decoder structure). Convolution network in FCN has some convolutional layers and pooling operations. Deconvolution network is a mirrored version of the convolution network, and has some deconvolution layers, unpooling operations and 1x1 convolutional layers performed between layers.

In order to improve the problem of processing details using the FCN model, the position of the max value will be saved first when maxpooling operation is performed. So that, it can use the unpooling operation to return to its original position. For the rest of the non-max values, add 0, then the upsampling operation is performed by deconvolution network.

U-Net (2015)

🔖 Keras github: https://github.com/zhixuhao/unet

U-Net is based on Encoder-Decoder structure. It is applied to biomedical image segmentaion tasks. Encoder (It is also called contracting path) is responsible for extracting features. Decoder (It is also called expansive path) is used for the upsampling operation, and the network architecture is similar to U. In addition, the feature map through the upsampling operation will be concatenated with the feature map from contracting path. The feature map from contracting path needs to be cropped because of its larger size.

SegNet (2015)

🔖 Github: https://github.com/alexgkendall/SegNet-Tutorial

The structure of SegNet is smiliar to DeconvNet. The difference is that the middle 1x1 convolutional layer is removed to reduce the memory usage and improve the inference speed. It is mainly used for scene understanding.

The model architecture of SegNet from the following figure. Encoder is responsible for extracting features, and the Decoder is to upsampling the feature map.

📚 The difference between SegNet and U-Net

The upsampling operation in SegNet is performed by deconvolution network. While in U-Net, the upsampling operation is performed by deconvolution network followed by a concatenation with the correspondingly cropped feature map from the contracting path, and two 3x3 convolutional layers.

📚 The difference between SegNet and FCN

The maxpooling operation in SegNet will save the position of the maximum value, and then retrieve the original position through the unpooling operation, and add 0 to the rest of the non-maximum values.

The upsampling operation in FCN is concatenated with another feature map of the same size after being performed through deconvolution network. Next, perform upsampling again.

DeepLab v1~v3、DeepLab v3+

🔖 Github: https://github.com/tensorflow/models/tree/master/research/deeplab

For the detailed, please refer to this article: Witnessing the Progression in Semantic Segmentation: DeepLab Series from V1 to V3+

RefineNet (CVPR 2017)

🔖 Github: https://github.com/guosheng/refinenet

Deconvolution network and Atrous Convolution are mainly used in previous semantic segmentation problems, but these two methods have some disadvantages.

Deconvolution network cannot recover low-level features and predicts poor details.
Atrous Convolution requires heavy GPU usage and computation.

The author considers that all level features are useful, so a multi-stage network architecture RefineNet is proposed. The purpose is to merge all level features to generate the high resolution predictions.

The network architecture is based on ResNet and divided into four parts according to the image resolution: 1/4 , 1/8, 1/16, 1/32 of the original image. Next, these images are fed into the correspondingly blocks of RefineNet for fusion and refinement. It should be noted that RefineNet-4 has only one input, while other blocks have two inputs. Furthermore, the parameters in these four blocks of RefineNet are not shared.

The blocks of RefineNet are consists of Residual convolution unit (RCU), Multi-resolution fusion, Chained residual pooling, Output convolutions.

Residual convolution unit (RCU): The simplified version of the ResNet block. It is responsible for extracting features.
Multi-resolution fusion: The other blocks have two inputs except for the block of RefineNet-4, so the unit will fuse these two inputs (features) of different resolutions. The method is to perform 3x3 convolutional layer, then upsample the smaller resolution feature map to the same size and add them together.
Chained residual pooling: The unit will gain the background content from the large area. The method is to be divided into two parts by ReLU, one part performs several pooling operations and convolutional layers, and the results is added with the other part afterwards.
Output convolutions: The unit will perform RCU before output.

PSPNet (CVPR 2017)

🔖 Github: https://github.com/hszhao/PSPNet

The architecture of PSPNet is that the pyramid network structure based on the fusion of global context information. It won the scene parsing championship in the ImageNet competition in 2016 and was included in CVPR 2017.

The backbone uses a pretrained ResNet model with the dilated network to enlarge the receptive field. Extract the features using Pyramid Pooling Module and perform a convolutional layer to obtain the final result.

The method of Pyramid Pooling Module is to use Global average pooling of the different sizes, such as 1x1, 2x2, 3x3, 6x6, to extract the global context information. Dimensionality reduction through a 1x1 convolutional layer, then these feature maps are upsampled to the input feature map size and concatenated together to fuse the global and detailed features.