Semantic Segmentation: U-Net
Computer Vision is a vast domain comprising of various tasks that typically are: Image Classification, Object Detection, Semantic Segmentation, Instance Segmentation, etc. Each of these tasks has its own challenges and techniques to solve those problems.
The focus of this article is to discuss Semantic Segmentation and U-Net which is used to carry out semantic segmentation.
To put it in simple terms:
Semantic Segmentation is the task of classifying each pixel in an image from a predefined set of classes.
Or
Semantic Segmentation is a pixel-wise classification of a given image.
Semantic Segmentation can be done with various deep learning and traditional computer vision approaches each having its pros/cons. However, we are going to talk about U-Net in this article.
U-Net was initially designed for Bio-medical Image Segmentation and typically for ISBI 2015 challenges. However, the model was widely adopted because it performed exceptionally well even beyond the biomedical imaging domain.
U-Net belongs to a class of Encoder-Decoder models. The encoder model performs feature extraction on the given input and the decoder upsamples the reduced feature map to the original size of the image.
Encoder Model
The Encoder model is similar to a deep learning classification model which performs feature extraction on the given input image. The features extracted at each block of the classification network are passed to the corresponding block of the decoder model.
The proposed U-Net model used the VGG16 model as the encoder. However, in recent years we have seen people using various CNN models as encoders i.e ResNet, EfficientNet, etc.
Decoder Model
The task of the decoder model is to upsample and merge features acquired from the encoder model. The original U-Net proposed by the authors used Transpose Convolutions for upsampling the feature maps but these layers had Checkerboard effect and had learnable parameters that needed a lot of computations leading to excessive memory consumption. Therefore the Transpose Convolutions were replaced by the Bilinear upsampling layers.
The left half of the above diagram consists of the encoder model and the right half of the above diagram consists of the decoder architecture.
Evaluation Metrics
To evaluate the performance of a model we need metrics that evaluate them. The evaluation metrics for U-Net (Semantic Segmentation) model are:
Pixel Accuray
Pixel accuracy is basically the number of pixels that are classified correctly in the generated segmentation mask. This may be the simplest metric to evaluate the performance but might not actually address the performance of the model.
The problem with the pixel accuracy metric is that it always gets biased when there is an extreme class imbalance in the dataset. You could be getting ~90% accuracy but the actual qualitative performance would be worse.
Intersection Over Union (IOU)
Intersection Over Union is also known as the Jaccard Index used for evaluating semantic segmentation models.
IoU is the area of overlap between the predicted segmentation and the ground truth divided by the area of union between the predicted segmentation and the ground truth, as shown on the image to the left. This metric ranges from 0–1 (0–100%) with 0 signifying no overlap and 1 signifying perfectly overlapping segmentation.
Below is the implementation of IOU in keras:
from keras import backend as Kdef iou_coef(y_true, y_pred, smooth=1):
intersection = K.sum(K.abs(y_true * y_pred), axis=[1,2,3])
union = K.sum(y_true,[1,2,3])+K.sum(y_pred,[1,2,3])-intersection
iou = K.mean((intersection + smooth) / (union + smooth), axis=0)
return iou
Dice Coefficient (F1 Score)
To put it simply Dice Coefficient is 2 * the Area of Overlap divided by the total number of pixels in both images.
Implementation of Dice Coefficient in keras:
def dice_coef(y_true, y_pred, smooth=1):
intersection = K.sum(y_true * y_pred, axis=[1,2,3])
union = K.sum(y_true, axis=[1,2,3]) + K.sum(y_pred, axis=[1,2,3])
dice = K.mean((2. * intersection + smooth)/(union + smooth), axis=0)
return dice
Datasets
We have previously looked at some evaluation metrics for semantic segmentation models. Now we have prepared a list of publically available semantic segmentation datasets on which various models are benchmarked. The datasets are:
- PASCAL VOC 2012 Segmentation Competition
- COCO 2018 Stuff Segmentation Task
- BDD100K: A Large-scale Diverse Driving Video Database
- Cambridge-driving Labeled Video Database (CamVid)
- Cityscapes Dataset
- Mapillary Vistas Dataset
- ApolloScape Scene Parsing
Publically Available Implementations
Since U-Net is half a decade old now and it has seen a lot of variants/advancements over the years now. Therefore it would be wise to use off the shelf implementations instead of writing code for the model from scratch if your focus is on using the models instead of trying to beat a benchmark. Some of the publically available implementations of semantic segmentation models that I personally use are mentioned below:
- Pytorch implementations
- TensorFlow Segmentation Models (Personally Use it)
- PyTorch Segmentation Models (Personally Use it)
References
- https://link.springer.com/chapter/10.1007/978-3-319-24574-4_28
- https://towardsdatascience.com/metrics-to-evaluate-your-semantic-segmentation-model-6bcb99639aa2
- https://www.jeremyjordan.me/semantic-segmentation/
- https://www.jeremyjordan.me/semantic-segmentation/
- https://github.com/meetps/pytorch-semseg
- https://github.com/qubvel/segmentation_models
- https://github.com/qubvel/segmentation_models.pytorch
About Author
- LinkedIn: https://pk.linkedin.com/in/nauyan
- GitHub: https://github.com/nauyan