A Simple Guide to Semantic Segmentation

A comprehensive review of Classical and Deep Learning methods for Semantic Segmentation

Bharath Raj
Mar 4, 2019 · 10 min read

Written by Bharath Raj with feedback from Noy Shulman and Rotem Alaluf.

Photo by JFL on Unsplash

Semantic Segmentation is the process of assigning a label to every pixel in the image. This is in stark contrast to classification, where a single label is assigned to the entire picture. Semantic segmentation treats multiple objects of the same class as a single entity. On the other hand, instance segmentation treats multiple objects of the same class as distinct individual objects (or instances). Typically, instance segmentation is harder than semantic segmentation.

Comparison between semantic and instance segmentation. (Source)

This blog explores some methods to perform semantic segmentation using classical as well as deep learning based approaches. Moreover, popular loss function choices and applications are discussed.

Classical Methods

Gray Level Segmentation

The problem with this method is that rules must be hard-coded. Moreover, it is extremely difficult to represent complex classes such as humans with just gray level information. Hence, feature extraction and optimization techniques are needed to properly learn the representations required for such complex classes.

Conditional Random Fields

Pixels with label dog mixed with pixels with label cat (image c). A more realistic segmentation is shown in image d. (Source)

These can be avoided by considering a prior relationship among pixels, such as the fact that objects are continuous and hence nearby pixels tend to have the same label. To model these relationships, we use Conditional Random Fields (CRFs).

CRFs are a class of statistical modelling methods used for structured prediction. Unlike discrete classifiers, CRFs can consider “neighboring context” such as relationship between pixels before making predictions. This makes it an ideal candidate for semantic segmentation. This section explores the usage of CRFs for semantic segmentation.

Each pixel in the image is associated with a finite set of possible states. In our case, the target labels are the set of possible states. The cost of assigning a state (or label, u) to a single pixel (x) is known as its unary cost. To model relationships between pixels, we also consider the cost of assigning a pair of labels (u,v) to a pair of pixels (x,y) known as the pairwise cost. We can consider pairs of pixels that are its immediate neighbors (Grid CRF) or we can consider all pairs of pixels in the image (Dense CRF)

Dense vs Grid CRF. (Source)

The sum of the unary and pairwise cost of all pixels is known as the energy (or cost/loss) of the CRF. This value can be minimized to obtain a good segmentation output.

Deep Learning Methods

1. Model Architectures

Downsampling and Upsampling in an FCN. (Source)

This basic architecture, despite being effective, has a number of drawbacks. One such drawback is the presence of checkerboard artifacts due to uneven overlap of the output of the transpose-convolution (or deconvolution) operation.

Formation of Checkerboard Artifacts. (Source)

Another drawback is poor resolution at the boundaries due to loss of information from the process of encoding.

Several solutions were proposed to improve the performance quality of the basic FCN model. Below are some of the popular solutions that proved to be effective:


U-Net. (Source)

This skip connections allows gradients to flow better and provides information from multiple scales of the image size. Information from larger scales (upper layers) can help the model classify better. Information from smaller scales (deeper layers) can help the model segment/localize better.

Tiramisu Model

Tiramisu Network. (Source)

A downside of this method is that due to the nature of the concatenation operations in several ML frameworks, it is not very memory efficient (requires a large GPU to run).

MultiScale methods

PSPNet. (Source)

Atrous (Dilated) Convolutions present an efficient method to combine features from multiple scales without increasing the number of parameters by a large amount. By adjusting the dilation rate, the same filter has its weight values spread out farther in space. This enables it to learn more global context.

Cascaded Atrous Convolutions. (Source)

The DeepLabv3 paper uses Atrous Convolutions with different dilation rates to capture information from multiple scales, without significant loss in image size. They experiment with using Atrous convolutions in a cascaded manner (as shown above) and also in a parallel manner in the form of Atrous Spatial Pyramid Pooling (as shown below).

Parallel Atrous Convolutions. (Source)

Hybrid CNN-CRF methods

Methods using combinations of CNN and CRF. (Source)

Certain methods incorporate the CRF within the neural network itself, as presented in CRF-as-RNN where the Dense CRF is modelled as a Recurrent Neural Network. This enables end-to-end training, as illustrated in the above image.

2. Loss Functions

Pixel-wise Softmax with Cross Entropy

One-Hot format for semantic segmentation. (Source)

Since the label is in a convenient one-hot form, it can be directly used as the ground truth (target) for calculating cross-entropy. However, softmax must be applied pixel-wise on the predicted output before applying cross entropy, as each pixel can belong to any one our target classes.

Focal Loss

Consider the plot of the standard cross entropy loss equation as shown below (Blue color). Even in the case where our model is pretty confident about a pixel’s class (say 80%), it has a tangible loss value (here, around 0.3). On the other hand, Focal Loss (Purple color, with gamma=2)does not penalize the model to such a large extent when the model is confident about a class (i.e. loss is nearly 0 for 80% confidence).

Standard Cross Entropy (Blue) vs Focal Loss with various values of gamma. (Source)

Let us explore why this is significant with an intuitive example. Assume we have an image with 10000 pixels, with only two classes: Background class (0 in one-hot form) and Target class (1 in one-hot form). Let us assume 97% of the image is the background and 3% of the image is the target. Now, say our model is 80% sure about pixels that are background, but only 30% sure about pixels that are the target class.

While using cross-entropy, loss due to background pixels is equal to (97% of 10000) * 0.3 which equals 2850 and loss due to target pixels is equal to (3% of 10000) * 1.2 which equals 360. Clearly, the loss due to the more confident class dominates, and there is very low incentive for the model to learn the target class. Comparatively, with focal loss, loss due to background pixels is equal to (97% of 10000) * 0 which is 0. This allows the model to learn the target class better.

Dice Loss

Dice Coefficient. (Source)

Our objective is to maximize the overlap between the predicted and ground truth class (i.e. to maximize the Dice Coefficient). Hence, we generally minimize (1-D) instead to obtain the same objective, as most ML libraries provide options for minimization only.

Derivative of Dice Coefficient. (Source)

Even though Dice Loss works well for samples with class imbalance, the formula for calculating its derivative (shown above) has squared terms in the denominator. When those values are small, we could get large gradients, leading to training instability.


Autonomous Driving

Semantic segmentation for autonomous vehicles. (Source)

One constraint on autonomous vehicles is that performance must be real time. A solution to the above problem is to integrate a GPU locally along with the vehicle. To enhance performance of the above solution, lighter (low parameters) neural networks can be used or techniques to fit neural networks on the edge can be implemented.

Medical Image Segmentation

Segmentation of medical scans. (Source)

We can also automate less critical operations such as estimating the volume of organs from 3D semantically segmented scans.

Scene Understanding

Scene Understanding in action. (Source)

Fashion Industry

Semantic segmentation used as an intermediate step to redress a human based on text input. (Source)

Satellite (Or Aerial) Image Processing

Semantic segmentation of satellite/aerial images. (Source)



Reinventing Enterprise AI

Thanks to Noy Shulman

Bharath Raj

Written by

Exploring Computer Vision and Machine Learning | https://thatbrguy.github.io


BeyondMinds is an Enterprise AI software provider with the mission to bridge the gap between academic research and maximal-value, enterprise-scale AI Applications

Bharath Raj

Written by

Exploring Computer Vision and Machine Learning | https://thatbrguy.github.io


BeyondMinds is an Enterprise AI software provider with the mission to bridge the gap between academic research and maximal-value, enterprise-scale AI Applications

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store