Semantic Segmentation: U-Net

Syed Nauyan Rashid
Red Buffer
Published in
4 min readNov 5, 2021

Computer Vision is a vast domain comprising of various tasks that typically are: Image Classification, Object Detection, Semantic Segmentation, Instance Segmentation, etc. Each of these tasks has its own challenges and techniques to solve those problems.

Computer Vision Tasks

The focus of this article is to discuss Semantic Segmentation and U-Net which is used to carry out semantic segmentation.

To put it in simple terms:

Semantic Segmentation is the task of classifying each pixel in an image from a predefined set of classes.

Or

Semantic Segmentation is a pixel-wise classification of a given image.

Semantic Segmentation Example

Semantic Segmentation can be done with various deep learning and traditional computer vision approaches each having its pros/cons. However, we are going to talk about U-Net in this article.

U-Net was initially designed for Bio-medical Image Segmentation and typically for ISBI 2015 challenges. However, the model was widely adopted because it performed exceptionally well even beyond the biomedical imaging domain.

U-Net Model Architecture

U-Net belongs to a class of Encoder-Decoder models. The encoder model performs feature extraction on the given input and the decoder upsamples the reduced feature map to the original size of the image.

Encoder Model

The Encoder model is similar to a deep learning classification model which performs feature extraction on the given input image. The features extracted at each block of the classification network are passed to the corresponding block of the decoder model.

The proposed U-Net model used the VGG16 model as the encoder. However, in recent years we have seen people using various CNN models as encoders i.e ResNet, EfficientNet, etc.

Decoder Model

The task of the decoder model is to upsample and merge features acquired from the encoder model. The original U-Net proposed by the authors used Transpose Convolutions for upsampling the feature maps but these layers had Checkerboard effect and had learnable parameters that needed a lot of computations leading to excessive memory consumption. Therefore the Transpose Convolutions were replaced by the Bilinear upsampling layers.

The left half of the above diagram consists of the encoder model and the right half of the above diagram consists of the decoder architecture.

Evaluation Metrics

To evaluate the performance of a model we need metrics that evaluate them. The evaluation metrics for U-Net (Semantic Segmentation) model are:

Pixel Accuray

Pixel accuracy is basically the number of pixels that are classified correctly in the generated segmentation mask. This may be the simplest metric to evaluate the performance but might not actually address the performance of the model.

The problem with the pixel accuracy metric is that it always gets biased when there is an extreme class imbalance in the dataset. You could be getting ~90% accuracy but the actual qualitative performance would be worse.

Intersection Over Union (IOU)

Intersection Over Union is also known as the Jaccard Index used for evaluating semantic segmentation models.

IoU calculation visualized. Source: Wikipedia

IoU is the area of overlap between the predicted segmentation and the ground truth divided by the area of union between the predicted segmentation and the ground truth, as shown on the image to the left. This metric ranges from 0–1 (0–100%) with 0 signifying no overlap and 1 signifying perfectly overlapping segmentation.

Below is the implementation of IOU in keras:

from keras import backend as Kdef iou_coef(y_true, y_pred, smooth=1):
intersection = K.sum(K.abs(y_true * y_pred), axis=[1,2,3])
union = K.sum(y_true,[1,2,3])+K.sum(y_pred,[1,2,3])-intersection
iou = K.mean((intersection + smooth) / (union + smooth), axis=0)
return iou

Dice Coefficient (F1 Score)

To put it simply Dice Coefficient is 2 * the Area of Overlap divided by the total number of pixels in both images.

Illustration of Dice Coefficient. 2xOverlap/Total number of pixels

Implementation of Dice Coefficient in keras:

def dice_coef(y_true, y_pred, smooth=1):
intersection = K.sum(y_true * y_pred, axis=[1,2,3])
union = K.sum(y_true, axis=[1,2,3]) + K.sum(y_pred, axis=[1,2,3])
dice = K.mean((2. * intersection + smooth)/(union + smooth), axis=0)
return dice

Datasets

We have previously looked at some evaluation metrics for semantic segmentation models. Now we have prepared a list of publically available semantic segmentation datasets on which various models are benchmarked. The datasets are:

Publically Available Implementations

Since U-Net is half a decade old now and it has seen a lot of variants/advancements over the years now. Therefore it would be wise to use off the shelf implementations instead of writing code for the model from scratch if your focus is on using the models instead of trying to beat a benchmark. Some of the publically available implementations of semantic segmentation models that I personally use are mentioned below:

--

--