Comparative study of Image segmentation architectures using Deep learning

Published in

DataX Journal

5 min readSep 28, 2021

Before we start looking at different techniques for semantic segmentation and object detection using deep learning, we must first understand the meaning of the words

What exactly is semantic segmentation?

Semantic segmentation is understanding an image at pixel level i.e, we want to assign each pixel in the image an object class. For example, check out the following images.

Left: Input image. Right: It’s semantic segmentation. (Source)

Apart from recognizing the bike and the person riding it, we also have to delineate the boundaries of each object. Therefore, unlike classification, we need dense pixel-wise predictions from our models.

VOC2012 and MSCOCO are the most important datasets for semantic segmentation.

What are the different approaches?

Before deep learning took over computer vision, people used approaches like TextonForest and Random Forest-based classifiers for semantic segmentation. As with image classification, convolutional neural networks (CNN) have had enormous success on segmentation problems.

One of the popular initial deep learning approaches was patch classification where each pixel was separately classified into classes using a patch of images around it. The main reason to use patches was that classification networks usually have full connected layers and therefore required fixed-size images.

The proposed network architectures: (a) Fully convolutional networks (FCN), (b) Autoencoder networks (AEN), and © UNet.

In 2014, Fully Convolutional Networks (FCN) by Long et al. from Berkeley, popularized CNN architectures for dense predictions without any fully connected layers. This allowed segmentation maps to be generated for images of any size and was also much faster compared to the patch classification approach. Almost all the subsequent state-of-the-art approaches on semantic segmentation adopted this paradigm.

Apart from fully connected layers, one of the main problems with using CNNs for segmentation is pooling layers. Pooling layers increase the field of view and are able to aggregate the context while discarding the ‘where’ information. However, semantic segmentation requires the exact alignment of class maps and thus, needs the ‘where’ information to be preserved. Two different classes of architectures evolved in the literature to tackle this issue.

Summaries

The following papers are summarized

U-Net
Seg-Net
Deep Lab

U-Net

UNET is a U-shaped encoder-decoder network architecture, which consists of four encoder blocks and four decoder blocks that are connected via a bridge. The encoder network (contracting path) has half the spatial dimensions and double the number of filters (feature channels) at each encoder block. Likewise, the decoder network doubles the spatial dimensions and half the number of feature channels

Encoder Block

From the original paper

The contracting path follows the typical architecture of a convolutional network. It consists of the repeated application of two 3×3 convolutions (unpadded convolutions), each followed by a rectified linear unit (ReLU) and a 2×2 max pooling operation with stride 2 for downsampling. At each downsampling step we double the number of feature channels.

Decoder Block

From the original paper

Every step in the expansive path consists of an upsampling of the feature map followed by a 2×2 convolution (“up-convolution”) that halves the number of feature channels, a concatenation with the correspondingly cropped feature map from the contracting path, and two 3×3 convolutions, each followed by a ReLU

Seg-Net

Uses a novel technique to upsample encoder output which involves storing the max-pooling indices used in the pooling layer. This gives reasonably good performance and is space-efficient. VGG16 with only forward connections and nontrainable layers is used as encoder. This leads to very few parameters.

Encoder

At the encoder, convolutions and max pooling are performed.
There are 13 convolutional layers from VGG-16. (The original fully connected layers are discarded.)
While doing 2×2 max pooling, the corresponding max pooling indices (locations) are stored.

Decoder

At the decoder, upsampling and convolutions are performed. At the end, there is softmax classifier for each pixel.
During upsampling, the max pooling indices at the corresponding encoder layer are recalled to upsample as shown above.
Finally, a K-class softmax classifier is used to predict the class for each pixel.

Deep Lab(v1 and v2)

These architectures in the second class use what are called as dilated/atrous convolutions and do away with pooling layers.

Conditional Random Field (CRF) postprocessing are usually used to improve the segmentation. CRFs are graphical models which ‘smooth’ segmentation based on the underlying image intensities. They work based on the observation that similar intensity pixels tend to be labeled as the same class. CRFs can boost scores by 1–2%.

Atrous/Dilated convolutions increase the field of view without increasing the number of parameters. Net is modified like in dilated convolutions paper.

Multiscale processing is achieved either by passing multiple rescaled versions of original images to parallel CNN branches (Image pyramid) and/or by using multiple parallel atrous convolutional layers with different sampling rates (ASPP).

Structured prediction is done by fully connected CRF. CRF is trained/tuned separately as a post processing step.

Key Contributions:

Use atrous/dilated convolutions.
Propose atrous spatial pyramid pooling (ASPP)
Use Fully connected CRF

The Future

There are dozens of different architectures for the problem at hand and each is useful in its own way. And, I am sure there will be dozens of new methods in the future too. This blog was to introduce the most popular and widely used networks

Until then, I would love to connect with you on Twitter as well as Github!

Twitter

Github