Semantic Segmentation using U-NETs

Published in

Analytics Vidhya

8 min readSep 22, 2020

As the name suggests, Image segmentation is the process of dividing an image into multiple segments. It is subdivided into two parts, namely Instance segmentation and Semantic segmentation. In Instance segmentation, each instance in an image is assigned a different label. In Semantic segmentation, instances belonging to the same object class are assigned the same label. For example, in a picture that contains five people and a table, an instance segmentation algorithm will label the five people and the table differently. Whereas, a semantic segmentation algorithm will assign the same label to all the five people and a different one for just the table.

Comparison of Semantic Segmentation and Instance Segmentation (Source)

In this blog, we will go deep into semantic segmentation using U-NETs.

Difference between Semantic segmentation and Object detection.

Semantic segmentation should not be confused with object detection. Object detection is the process of creating a bounding box around the object of interest, whereas Semantic segmentation colour codes each pixel of different object classes in different colours.

Let us consider some examples.

Consider the following image:

The above image is an example of Object Detection. It is basically the combination of object localization and classification.

Now consider this image:

This is an example of semantic segmentation. As you can see, all the cars are highlighted with blue, the people with red, etc. The objects that fall under the same object class are highlighted in the same colour.

Performance Metrics

Before going into the architecture of U-NETs, let us understand some commonly used performance metrics used to evaluate semantic segmentation algorithms.

Calculate the average precision and recall for each class.

This is what precision tells us: Of all the points the model declared to be positive, what per cent of them are actually positive?

Precision can be calculated using the following formula:

TP = True Positive

FP = False Positive

This is what recall tells us: Of all the actually positive points, how many are predicted positive?

Recall can be calculated using the following formula:

P = Total number of positive points.

Intersection over Union(IoU)

This is also known as Jaccard Similarity. As the name suggests, the area of intersection of the ground truth and predicted output is divided by the union of the area of both. It is a measure of similarity between the ground truth and the predicted values.

Let Region A = Ground Truth

Region B = Predicted Output

The IoU value can be calculates using the following formula:

The union value can be calculated by the following formula:

The best value for IoU is 1 and the worst is 0. The greater the IoU values, the more the predicted output coincides with the ground truth.

For the purpose of semantic segmentation, an average of the IoU values can be taken over the multiple prevailing classes to measure the performance of the algorithm.

U-NETs

Let us dive into the architecture of U-nets.

To gain a better understanding of the entire architecture, lets us break it down into steps.

STEP 1: Consider an input image of 572x572x1.(Grayscale image)

STEP 2: From the above image, the next step is 3x3 Conv, ReLU.

If we have an image of size NxN and a filter/kernel of size KxK, then the output of this convolution layer will be (N-K+1)x(N-K+1).

The size of our input image is 572x572. Therefore here, N=572.

Our kernel size is 3x3. Hence, K=3

Size of the output image will be 570x570

The depth of the output image will be equal to the number of kernels.

Since there are 64 kernels, the output image of step 2 is of dimensions 570x570x64.

STEP 3: 3x3 Conv, ReLU

If we follow the same steps as done in step 2, the dimension of the output image of step 3 is 568x568x64. Let this output be named as A.

STEP 4: Maxpool 2x2, stride=2

If input size is NxN and filter size is KxK, size of maxpool output will be (N-K+2P+1)x(N-K+2P+1). Here P refers to padding which is often considered to be 1(zero-padding).

Therefore, the output size after max-pooling is 569x569x64

If input size is NxN, filter size is KxK and stride is S, the size of the output will be (⌊(N-K)∕S⌋+1)x(⌊(N-K)∕S⌋+1) where ⌊ ⌋ is the floor function.

The dimension of the output image of step 4 is 284x284x64.

This three layers in step 2,3 and 4 will repeat four more times. If you have understood then you can skip to STEP 16. Otherwise, no worries! I have run through all the steps below.

STEP 5: 3x3 Conv, ReLU , 128 kernels

If we follow the same steps as done in step 2, the dimension of the output image of step 5 is 282x282x128.

STEP 6: 3x3 Conv, ReLU , 128 kernels

If we follow the same steps as done in step 2, the dimension of the output image of step 6 is 280x280x128. Let this output be named as B.

STEP 7: Maxpool 2x2, stride=2

If we follow the same steps as in step 4, the output size after step 7 is 140x140x128

STEP 8: 3x3 Conv, ReLU , 256 kernels

If we follow the same steps as done in step 2, the dimension of the output image of step 8 is 138x138x256.

STEP 9: 3x3 Conv, ReLU , 256 kernels

If we follow the same steps as done in step 2, the dimension of the output image of step 9 is 136x136x256. Let this output be named as C.

STEP 10: Maxpool 2x2, stride=2

If we follow the same steps as in step 4, the output size after step 10 is 68x68x256

STEP 11: 3x3 Conv, ReLU , 512 kernels

If we follow the same steps as done in step 2, the dimension of the output image of step 8 is 66x66x512.

STEP 12: 3x3 Conv, ReLU , 512 kernels

If we follow the same steps as done in step 2, the dimension of the output image of step 12 is 64x64x512. Let this output be named as D.

STEP 13: Maxpool 2x2, stride=2

If we follow the same steps as in step 4, the output size after step 13 is 32x32x512

STEP 14: 3x3 Conv, ReLU , 1024 kernels

If we follow the same steps as done in step 2, the dimension of the output image of step 14 is 30x30x1024.

STEP 15: 3x3 Conv, ReLU , 1024 kernels

If we follow the same steps as done in step 2, the dimension of the output image of step 15 is 28x28x1024. Let this output be named as E.

STEP 16: Copy and Crop on ‘D’ and up-conv 2x2 on ‘E’

This is where it gets interesting.

When we use Up-conv, each and every row and column in the input image gets duplicated. The depth with be reduced to half.
When we use Crop and Copy, all the central pixels (pixels excluding those in the boundaries) will be cropped and copied.

Therefore, the output of step 16 has a dimension of 56x56x1024

STEP 17: 3x3 Conv, ReLU , 512 kernels

If we follow the same steps as done in step 2, the dimension of the output image of step 17 is 54x54x512.

STEP 18: 3x3 Conv, ReLU , 512 kernels

If we follow the same steps as done in step 2, the dimension of the output image of step 18 is 52x52x512. Let this output be named as F.

STEP 19: Up-conv on ‘F; and Crop and Copy on ‘C’

The output of step 19 has a dimension of 104x104x512

STEP 20: 3x3 Conv, ReLU , 256 kernels

If we follow the same steps as done in step 2, the dimension of the output image of step 20 is 102x102x256.

STEP 21: 3x3 Conv, ReLU , 256 kernels

If we follow the same steps as done in step 2, the dimension of the output image of step 21 is 100x100x256. Let this output be named as G.

STEP 22: Up-conv on ‘G’ and Crop and Copy on ‘B’

The output of step 22 has a dimension of 200x200x256

STEP 23: 3x3 Conv, ReLU , 128 kernels

If we follow the same steps as done in step 2, the dimension of the output image of step 23 is 198x198x128.

STEP 24: 3x3 Conv, ReLU , 128 kernels

If we follow the same steps as done in step 2, the dimension of the output image of step 24 is 196x196x128. Let this output be named as H.

STEP 25: Up-conv on ‘H’ and Crop and Copy on ‘A’

The output of step 25 has a dimension of 392x392x128

STEP 26: 3x3 Conv, ReLU , 64 kernels

If we follow the same steps as done in step 2, the dimension of the output image of step 26 is 390x390x64.

STEP 27: 3x3 Conv, ReLU , 64 kernels

If we follow the same steps as done in step 2, the dimension of the output image of step 27 is 388x388x64.

STEP 28: 1x1 Conv, 2 kernels

Output dimension of step 28 is 388x388x2

This is the output image.

Steps 1–15 consist of the contracting path as the size of the image is reducing. Steps 16–28 consist of the expanding path as the size of the image is gradually increasing.

Reference: https://arxiv.org/abs/1505.04597