Introduction to Semantic Image Segmentation

Vidit Jain
Analytics Vidhya
Published in
8 min readApr 10, 2020

The aim of semantic image segmentation is to classify each pixel of an image. Image segmentation is a computer vision task which involves labelling various regions of the image into objects that are present in it. In this post, we will discuss how to use deep convolutional neural networks (CNNs) for the task of semantic image segmentation.

Convolutional Neural Networks have been pretty common in the field of Deep Learning nowadays. CNNs have been used for computer vision tasks like image classification, object detection, image generation etc. Like all other computer vision tasks, deep learning has known to outperform previously existing image segmentation approaches.

What is semantic image segmentation?

More precisely, semantic image segmentation is the task of labelling each pixel of the image into a predefined set of classes.

Segmentation of images (Source)

For example, in the above image various objects like cars, trees, people, road signs etc. can be used as classes for semantic image segmentation. So the task is to take an image (RGB or grayscale) and output a W x H x 1 matrix, where W and H represent the width and height of the image respectively. Each cell in this matrix would contain the predicted class-IDs for each pixel in the image.

Class labels for input image (Source)

Note: The above semantic labels matrix is a low-resolution representation of the input. Nonetheless, in reality, the output semantic labels matrix would have the same resolution as the input image.

In deep learning, we express categorical class labels as one-hot encoded vectors. Similarly, in semantic segmentation, we can express the output matrix using a one-hot encoding scheme by essentially creating one channel for each class label and marking those cells by 1 which contain the pixel of the corresponding class and marking the remaining cells by 0.

One-hot encoded segmentation map (Source)

Each channel in the above representation is called a mask as it highlights various regions of the image where the pixels of the particular class are present in the image.

How is it different from object detection?

To some, semantic image segmentation and object detection may appear to the same as both involve finding objects in the image and then classifying them. However, both techniques are not the same.

To give a quick overview without going into the details, object detection involves localising all objects in the image, enclosing all detected objects in a bounding box and giving a label to it. The below image is a sample output of a state of the art object detection algorithm.

Object detection (Source)

On the other hand, semantic segmentation works on the pixel level to label each pixel with a class. In other words, semantic segmentation would label each region of the image.

Object detection vs semantic segmentation

Read this blog post for further information on object detection and segmentation: https://towardsdatascience.com/a-hitchhikers-guide-to-object-detection-and-instance-segmentation-ac0146fe8e11

CNNs for semantic segmentation

Like other computer vision tasks, using a CNN for semantic segmentation would be the obvious choice. When using CNN for semantic segmentation, the output would an image of the same resolution as the input unlike a fixed-length vector in the case of image classification.

The general architecture of models contains a series of convolutional layers along with pooling or strided convolutional layers for downsampling. To improve the model, non-linear activations and batch normalisation layers are also used.

The initial layers in a convolutional neural network learn low-level features like lines, edges, colours etc. and the deeper layers learn high-level features like faces or objects etc.

Learned feature maps from various layers of a CNN

Shallower convolutional layers contain more information about smaller regions of the image. This is because when dealing with high dimensional inputs like an image, it is impractical to connect each neuron to all neurons in the previous volume. Instead, we connect each neuron to only a local region of the input volume. The spatial extent of this connectivity is called the receptive field of the neuron. Thus, as we add more layers, the size of the image keeps on decreasing and the number of channels keeps on increasing. The downsampling is done by the pooling layers.

For image classification, we reduce the size of the input image using a recurring set of convolutional and pooling layers and finally when the size of the image is small enough we flatten out the feature map and feed it into a fully-connected layer for classification purposes. So essentially, we are mapping the input image to a fixed size vector.

CNN for image classification (Source)

But by flattening out the feature maps leads to loss of spatial information, which is essential to the process of semantic segmentation. To retain the spatial information, no fully connected layers are used in the network. Convolutional layers are coupled with downsampling layers to produce a low-resolution spatial tensor. This tensor contains high-level information about the input contained in its various channels. In most implementations, this tensor has the lowest resolution with the largest amount of channels.

Now that we have obtained this low-resolution tensor, we somehow have to increase its resolution up to the original image to achieve the task of semantic segmentation. We feed this low-resolution feature map to upsampling layers followed by more convolution layers to create higher resolution feature maps. As we increase the resolution we simultaneously decrease the number of channels in the feature maps.

This kind of architecture is known as encoder-decoder architecture. The downsampling phase is known as the encoder and the upsampling phase is called the decoder.

Encoder-decoder architecture (Source)

The downsampling phase is called the encoder because it converts the input image into a low-resolution spatial tensor or in other words it encodes the input image into a low-resolution representation of the input. The upsampling phase takes the low-resolution tensor and decodes the information encoded in it to create high-resolution segmentation maps.

It can be observed that the above-described architecture does not use any fully connected layer to make predictions. Due to this such models are known as Fully Convolutional Networks (FCNs).

Loss Function

The last layer of the decoder uses a softmax function on each pixel to make predictions. So the map obtained after the activation would have values only ranging from 0 to 1. The output map has the same dimension as the input image with the third dimension or number of channels equal to the number of classes. So we can compare the output map with the one-hot encoded version of the ground truth to calculate the loss. Due to this formulation, the standard binary-crossentropy loss can be used.

Tips for improving training

Batch Normalisation: Using Batch Normalisation is known to speed up training, has a regularising effect as it adds some noise to the network, allows higher learning rates etc. Read this post for more information on Batch Normalisation.

Using skip-connections: Sometimes fully convolutional networks can be very deep due to which the model can suffer from the vanishing gradient problem. This problem is more prevalent in the deeper layers of the network and can lead to degraded training as learning becomes very slow in these layers. The gradients from the loss function shrink to 0 after successive applications of the chain rule. Hence, the gradient descent update value becomes 0 and thus no learning happens. We can use skip-connections from the previous layers to later layers. The skip-connections provide an alternate path for gradients to flow from the earlier layers to later layers preventing them from shrinking to 0.

Skip connections between the encoder and decoder: Instead of adding skip-connections in between layers of the network, skip-connections can be added between the encoder and decoder at appropriate positions. If we simply stack the encoder and decoder there may be a loss of low-level information. So the boundaries in the segmented map produced by the decoder may be inaccurate. To make up for the information lost, we can let the decoder access low-level features extracted by the encoder. These intermediate features can be concatenated or added to the decoder layers to make up for the loss and also increase the receptive field of the decoder.

Encoder-decoder with skip connections (Source)

Applications

Bio-medical image analysis: To make a medical diagnosis, radiologists need to interpret a variety of charts and photographs, but the difficulty of medical photographs, with many overlapping body configurations, can make diagnosis difficult even for qualified specialists. Systems using semantic segmentation can assist in classifying relevant regions of an image, making diagnostic testing easier and simpler. For example, models can be used to segment CT scans to detect tumours or more recently help in detecting the COVID-19 virus in lung CT scans.

Left: CT scan of the brain. Centre: Ground truth segmented image. Right: Segmented image by FCN.

Autonomous Vehicles: Autonomous driving is an incredibly complex activity, involving awareness, interpretation and adjustment in real-time. Semantic segmentation is used to classify items such as other vehicles and road signs, and regions such as road lanes and sidewalks. Segmentation of instances is used in autonomous driving, as it is important to track individual vehicles, pedestrians, signs, etc.

Segmentation of a road scene

Satellite image processing: Aerial or satellite images cover a vast area of land, and include a number of artefacts. Sophisticated image annotation is required to perform accurate analyzes on aerial or satellite images. Semantic segmentation has applications in precision farming and geo-sensing.

Segmentation of a satellite image

Facial segmentation: Performing semantic segmentation can help computer vision systems perform tasks such as recognizing gestures, recognizing age, and predicting the gender of individuals ‘ethnicity. Semantic segmentation allows for these activities by dividing regions of the face into essential features such as mouth, chin, nose, eyes, and hair. Efficient face segmentation means tests for factors such as image resolution, lighting conditions, feature occlusion, and orientation.

Segmentation of a human face (Source)

Some good reads for semantic segmentation

That’s all for this post. If you like this post feel free to give a clap and do share it with others. You can find me on LinkedIn or view some of my work on GitHub. If you have any questions or thoughts please drop a comment below. Thank you for reading this post!! Have a nice day…🎉

--

--