Vision beyond classification: Tasks beyond classification: Task II: Image Segmentation

5 min readApr 19, 2022

Lesson 4 notes from DeepMind lecture series in 2020

Image segmentation is a computer vision task in which we label specific regions of pixels in an image with their corresponding classes. Since we predict every pixel in the image, this task is commonly referred to as a dense prediction problem, whereas classification is a sparse prediction problem. There are two types of image segmentation: Semantic segmentation and Instance segmentation.

Now, let’s explore it!

What is semantic segmentation?

Semantic segmentation is the process of labeling one or more specific regions of interest in an image. This process treats multiple objects within a single category as one entity. For example: in Figure 1, semantic segmentation gives the same label to all the pixels of the sheep.

Input: An image with one or more objects. (RGB image H x W x 3)

Outputs: Class label for every pixel

**Figure 1:** Left: Input: an RGB image; Right: Output: class label for every pixel

Unlike sparse prediction problems, a dense prediction problem requires an output at the exact resolution as an input. So, how do we generate the same resolution output?

From previous notes, we have encountered a pooling technique to reduce the resolution. Now, let’s unpool (upsample) it to increase resolution! (Figure 2)

**Figure 2**: Unpooling: upsample to increase resolution; here we use 2x2 kernel (**Source**: DeepMind Lecture 4)

Now, let’s see how to incorporate the unpooling technique into a model by exploring the architecture of U-NET: the current state-of-the-art model in semantic segmentation!

U-NET:

U-NET is an encoder-decoder model where skip connections are added to preserve details. This network architecture (Figure 3) consists of a contracting path (an encoder on the left side) and an expansive path (a decoder on the right side).

The encoder path follows the typical architecture of a convolutional network. This path consists of the repeated application of two 3x3 convolutions, each followed by a rectified linear unit (ReLU) and a 2x2 max pooling operation with stride 2 for downsampling.

The decoder path has a similar structure as the encoder path. However, it replaces 2x2 max pooling operations with 2x2 upsampling operations. This path gives us the output with the exact resolution as the input.

However, we can get a blobby feature map due to upsampling. So, to preserve details, we add some long skip connections from the encoder layers at the same resolution level. Thus, we can add back high-frequency details that we might have lost during the pooling and unpooling operations by adding these skip connections.

How do we train this system?

Output: H x W x N_classes

Loss: pixel-wise cross entropy (Figure 4).

We have the probability distribution over the possible classes for every pixel in the output. To train this, we use the same cross-entropy loss as we use for classification, but now it’s averaged over all the pixels in the input.

What is instance segmentation?

Instance segmentation is the process of detecting and delineating each object of interest in an image. This process is a combination of object detection and semantic segmentation. However, it differs from semantic segmentation because it gives a unique label to every instance of a particular object in the image. For example: in Figure 5 (right), instance segmentation assigns different colors (labels) for each sheep, whereas semantic segmentation only assigns the same color (label) for all of those sheep (Figure 5, left).

**Figure 5**: Left: semantic segmentation; Right: instance segmentation

Mask R-CNN:

Mask R-CNN: (regional convolutional neural network) is state-of-the-art in image segmentation and instance segmentation. Mask R-CNN was built on top of Faster R-CNN, a popular framework for object detection. There are two stages in this framework:

The first stage, a Region Proposal Network (RPN), proposes candidate object bounding boxes.
The second stage extracts features from each candidate box and performs classification, bounding-box regression and a binary mask.

Metrics and benchmarks:

Evaluation metrics:

1. Classification:

The simplest metric to evaluate the performance of classification models is the accuracy: the percentage of correct predictions (Figure 6).

Where:

TP: true positive
TN: true negative
FP: false positive
FN: false negative

Generally, we use:

Top-1: top prediction is the correct class.
Top-5: correct class is in top-5 predictions.

2. Object detection and segmentation:

However, we use an intersection-over-union (IoU) to evaluate object detection and segmentation (Figure 7). Since IoU is non-differentiable, we can only use it for evaluation.

Benchmarks:

ImageNet: a major computer vision benchmark that evaluates algorithms for object detection and image classification at large scale.
Cityscapes: a benchmark suite and evaluation server for pixel-level, instance-level, and panoptic semantic labeling at large scale.
Coco: a large-scale object detection, segmentation, and captioning dataset.

Summary:

To sum up, we explored the second task beyond classification: image segmentation.

Specifically, we addressed:

Semantic segmentation: the process of labeling one or more specific regions of interest in an image while treating multiple objects within a single category as one entity.
U-NET: an encoder-decoder model for object semantic segmentation
Instance segmentation: the process of combining object detection and semantic segmentation where each instance in the same category is differentiated.
Mask R-CNN: an extended model of Faster R-CNN for object instance segmentation.
Metrics and benchmarks: provide some well-known metrics and benchmarks for classification, object detection, and image segmentation.

Coming next is notes from Lecture 4 in DeepMind’s deep learning series: Vision beyond classification: Beyond single image input

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com