Semantic Segmentation

Hack A BIT
hackabit
Published in
6 min readOct 15, 2019

What is Semantic Segmentation?

Semantic image segmentation is the task of classifying each pixel in an image from a predefined set of classes. In the following example, different entities are classified.

In the above example, the pixels belonging to the vehicle are classified in the class “vehicle”, the pixels corresponding to the roads are labeled as “road”, etc.

Usually, in an image with various entities, we want to know which pixel belongs to which entity, For example in an outdoor image, we can segment the roads, pedestrian, street lights, etc

Semantic segmentation is different from object detection as it does not predict any bounding boxes around the objects. We do not classify between different instances of the same object. For example, there could be multiple vehicles such as cars, trucks, bicycles, etc. in the scene and all of them would have the same label.

Semantic segmentation is different from instance segmentation which is that different objects of the same class will have different labels as in person1, person2, and hence different colors.

In general, we want to take input an image of size (HEIGHT X WIDTH X 3) and output a matrix of size (HEIGHT X WIDTH) containing the predicted class color corresponding to all the pixels.

Applications:

There are several applications for which semantic segmentation is very useful.

1. Medical Images: Automated segmentation of body scans can help doctors to perform diagnostic tests.

2. Autonomous Systems: Autonomous vehicles such as self-driving cars and drones can benefit from automated segmentation. For example, self-driving cars can detect derivable regions.

3. Geographical Image Analysis: Aerial images can be used to segment different types of land. Automated land mapping can also be done.

Semantic Segmentation Using Convolutional Neural Network:

Deep learning and convolutional neural networks (CNN) have been extremely ubiquitous in the field of computer vision. CNNs are popular for several computer vision tasks such as Image Classification, Object Detection, Image Generation, etc. Like for all other computer vision tasks, deep learning has surpassed other approaches for image segmentation.

The typical architecture of a convolutional neural network contains several convolutional layers with non-linear activations and pooling layers.

The initial layers learn the low-level features such as colors and edges and the later layers learn high-level features such as different objects, complex curves, etc.

At lower levels, the layer contains information for a small region of the image, whereas the later layers contain information for a large region of the image. Thus, as we add more layers, the size of the image keeps on decreasing and the number of channels keeps on increasing. The DOWN-SAMPLING is due to the pooling layers. The convolutional layers coupled with downsampling layers produce a low-resolution tensor containing the high-level information.

Taking the low-resolution spatial tensor, which contains high-level information, we have to produce high-resolution segmentation outputs. To do that we add more convolution layers coupled with UP-SAMPLING layers which increase the size of the spatial tensor. As we increase the resolution, we decrease the number of channels as we are getting back to the low-level information. This is called ENCODER-DECODER architecture.

To make up for the information lost, we let the decoder access the low-level features produced by the encoder layers. That is accomplished by skip connections. If we simply stack the encoder and decoder layers, there could be a loss of low-level information. Hence, the boundaries in segmentation maps produced by the decoder could be inaccurate.

Transfer Learning in Segmentation :

The model trained for image classification task contains meaning-full information which can be used for segmentation as well. We can re-use the convolution layers of the pre-trained models in the ENCODER LAYERS of the segmentation model. Using Resnet or VGG pre-trained on the ImageNet dataset is a popular choice.

Building the Model:

We will be using Keras API for building the segmentation model with skip connections

from keras.layers import Conv2D, MaxPooling2D, Dropout,pSampling2D, concatenateimg_input = Input(shape=(input_height,input_width , 3 ))

Encoder Layers :

Here, each block contains two convolution layers and one max pooling layer which would down-sample the image by a factor of two.

conv1 = Conv2D(32, (3, 3), activation=’relu’, padding=’same’)(img_input)
conv1 = Dropout(0.2)(conv1)
conv1 = Conv2D(32, (3, 3), activation=’relu’, padding=’same’)(conv1)
pool1 = MaxPooling2D((2, 2))(conv1)
conv2 = Conv2D(64, (3, 3), activation=’relu’, padding=’same’)(pool1)
conv2 = Dropout(0.2)(conv2)
conv2 = Conv2D(64, (3, 3), activation=’relu’, padding=’same’)(conv2)
pool2 = MaxPooling2D((2, 2))(conv2)

conv1 and conv2 contain intermediate the encoder outputs which will be used by the decoder. pool2 is the final output of the encoder.

Decoder Layers :

We concatenate the intermediate encoder outputs with the intermediate decoder outputs which are the skip connections.

conv3 = Conv2D(128, (3, 3), activation=’relu’, padding=’same’)(pool2)
conv3 = Dropout(0.2)(conv3)
conv3 = Conv2D(128, (3, 3), activation=’relu’, padding=’same’)(conv3)
up1 = concatenate([UpSampling2D((2, 2))(conv3), conv2], axis=-1)
conv4 = Conv2D(64, (3, 3), activation=’relu’, padding=’same’)(up1)
conv4 = Dropout(0.2)(conv4)
conv4 = Conv2D(64, (3, 3), activation=’relu’, padding=’same’)(conv4)
up2 = concatenate([UpSampling2D((2, 2))(conv4), conv1], axis=-1)
conv5 = Conv2D(32, (3, 3), activation=’relu’, padding=’same’)(up2)
conv5 = Dropout(0.2)(conv5)
conv5 = Conv2D(32, (3, 3), activation=’relu’, padding=’same’)(conv5)

Here conv1 is concatenated with conv4, and conv2 is concatenated with conv3.

out = Conv2D( n_classes, (1, 1) , padding=’same’)(conv5)

To get the final outputs, add a convolution with filters the same as the number of classes.

Base Model for Segmentation :

Usually, deep learning based segmentation models are built upon a base CNN network. A standard model such as ResNet, VGG or MobileNet is chosen for the base network usually. Some initial layers of the base network are used in the encoder, and the rest of the segmentation network is built on top of that. For most of the segmentation models, any base network can be used.

  1. Fully Connected Convolutional Network: This model uses various blocks of convolution and max pool layers to first decompress an image to 1/32th of its original size. It then makes a class prediction at this level of granularity. Finally, it uses upsampling and deconvolution layers to resize the image to its original dimensions. The three variants are FCN8, FCN16, and FCN32. In FCN8 and FCN16, skip connections are used

2. UNet: The UNet architecture adopts an encoder-decoder framework with skip connections. The encoder and decoder layers are symmetrical to each other.

3. PSPNet: The Pyramid Scene Parsing Network is optimized to learn a better global context representation of a scene. First, the image is passed to the base network to get a feature map. The feature map is downsampled to different scales. Convolution is applied to the pooled feature maps. After that, all the feature maps are upsampled to a common scale and concatenated together. Finally, another convolution layer is used to produce the final segmentation outputs. Here, the smaller objects are captured well by the features pooled to a high resolution, whereas the large objects are captured by the features pooled to a smaller size.

Conclusion:

In this post, we discussed the concepts of deep learning based segmentation. We then discussed various popular models used. Using Keras, we implemented the model architecture of a basic segmentation model.

Feel free to contact me.

References:

1. https://machinelearningmastery.com/upsampling-and-transpose-convolution-layers-for-generative-adversarial-networks/

2. https://www.tensorflow.org/tutorials/images/segmentation/

3. https://link.springer.com/article/10.1007/s10278-019-00227-x

Written By-

Abbas Ismail (abbas.tel2342@gmail.com)

Birla Institute of Technology, Mesra

--

--