Semantic Segmentation on Indian Driving Dataset

UNet and PSPNet Implementation

Yash Marathe

Published in

Analytics Vidhya

14 min readMay 31, 2020

Indian Driving Environment (Image Source-Page)

What is Image Semantic Segmentation?
Indian Driving Dataset Introduction
Dataset Overview
Performance metric
Data Visualization
Data Preparation
Model Building
Conclusion
Further Improvements
References

What is Semantic Segmentation?

Semantic Segmentation is a computer vision task that separates a digital image into multiple parts. In an era where cameras and other devices increasingly need to see and interpret the world around them, image segmentation has become an indispensable technique for teaching devices on how to understand the world around them.

It is different than image recognition, which assigns one or more labels to an entire image; and object detection, which localizes objects within an image by drawing a bounding box around them. Semantic Segmentation provides more fine-grain information about the contents of an image.

We can think of Image Semantic Segmentation as image classification at a pixel level. For example, in an image that has many cars, segmentation will label all the objects as car objects. Every photo is made up of many individual pixels, and the goal of image segmentation is to assign each of those pixels to the object to which it belongs. Segmenting an image allows us to separate the foreground from background, identify the precise location of a bus or a building, and clearly mark the boundaries that separate a tree from the sky. For a more clear and detailed explanation, visit Source Page.

Semantic Segmentation Example (Image Source — Page)

Indian Driving Dataset Introduction

Most of the datasets for autonomous navigation tend to focus on structured driving environments. This usually corresponds to well-delineated infrastructures such as lanes, a small number of well-defined categories for traffic participants, low variation in an object or background appearance, and strong adherence to traffic rules.

An Image from Cityscapes dataset containing structured driving environment:

And this is what a typical Indian Driving Environment looks like:

Indian Driving Dataset (IDD) is a collection of autonomous driving annotated street-level images captured on Indian roads, and formatted to facilitate use for the purpose of training AI systems and neural networks. It consists of images from unstructured environments where the above assumptions are largely not satisfied. It reflects label distributions of road scenes significantly different from existing datasets, with most classes displaying greater within-class diversity. Consistent with real driving behaviors, it also identifies new classes such as drivable areas besides the road.

It is difficult to completely avoid ambiguity between some labels. For example labels like parking, caravan, or trailer cannot be precisely defined due to the diversity of the scenes and vehicles in the data collected. For resolving this issue, the dataset is designed as a 4 level label hierarchy having 7 (level 1), 16 (level 2), 26 (level 3), and 30 (level 4) labels. The idd20k_lite dataset has 7 classes that include Drivable, Non-Drivable, Living things, Vehicles, Road-side objects, Far-objects, and Sky. The segmentation challenge is the pixel-level prediction of all the 7 classes at level 1 of the label hierarchy.

The images are obtained from a front-facing camera attached to a car. The car was driven around Hyderabad, Bangalore cities and their outskirts.

(Challenge Source — Page)

Dataset Overview

There are 1403 train images, 204 validation images, and 404 test images
The shape of the input image and segmentation masks is [227,320,3].
The shape of the output segmentation mask expected is [256,128].
For each training and validation image, we have its corresponding annotated image containing the labels of each pixel.

The dataset can be downloaded from this Page.

Performance metric

The performance metric is Mean-Intersection-Over-Union (mIoU).

mIoU is a common evaluation metric for Semantic Image Segmentation, which first computes the IOU for each semantic class and then computes the average over classes.

Intersection over Union formula visualization (Source-Page)

Examining this equation, you can see that Intersection over Union is simply a ratio. In the numerator, we compute the area of overlap between the predicted bounding box and the ground truth bounding box. The denominator is the area of the union, or more simply, the area encompassed by both the predicted bounding box and the ground-truth bounding box.

Dividing the area of overlap by the area of union yields our final score — the Intersection over Union. (Source-Page)

Data Visualization

Let’s take a sample image and analyze it along with its annotated image from the dataset.

Here’s the image:

Now, let’s plot a histogram to find the frequency of pixel intensity values.

Histogram plot to find the frequency of pixel intensity values of the above sample image

Let’s plot the annotated image now.

First of all, What is Image Annotation?

It is a process of labeling the data (available in the format of images), to make the object in the image recognizable to machines through computer vision technology. Basically, it is used to detect, classify, and group the objects in machine learning training. In an annotated image, each pixel is assigned the class label.

Here’s the annotated image:

Doesn’t make any sense, right?

That’s because each pixel has labels from 0–6 for the 7 classes to be predicted in the dataset. So basically, all the pixels essentially have intensity values in the range [0,6]. Typically 0 intensity value is taken to be black, and 255 is taken to be white. Hence, this image appears mostly black. The little white dots in between have a pixel value of 255.

Now, let’s draw the histogram plot of the annotated image to get the distribution of the pixel intensity values.

We can see that all the pixel intensity values lie in the range 0–6.

To get a more clear picture of the values belonging to each class label, let’s count the values belonging to each class.

Zoomed-in Histogram plot of values from 0 to 6

Count of pixel intensity values for each class

We can observe that the class label 0 has the highest count of values and class 2 has the lowest count. We can just consider the value 255 as class label 7 as it will be easier for prediction.

To visualize the annotated image and get a clearer picture, we can use the following method.

So, in the given annotated image, all the pixels (excluding the pixel value 255) are from the range 0–6 (for 7 classes). We can intensify these pixels to get clearer annotations, as with a wider range of pixels, it will be easier to distinguish between the pixels as colors vary on a wider range. We will multiply each pixel, excluding 0 and 255 with 40, so we will get different colors with a difference of larger values.

We can see that it is easy to distinguish between different classes after intensifying the pixels.

Data Preparation

First, let’s define the image size, the number of channels, and the number of classes to be predicted.

Now, we are going to define a function that given an image, will load it, and also load its corresponding annotation image and return a dictionary.

We need to get all the filenames of the train and validation images. For that, we can use tensorflow.data.Dataset.list_files. This method returns a dataset of all files matching one or more glob patterns. Basically, it returns a dataset of strings corresponding to file names corresponding to the pattern defined while calling the method.

We need to load the train and validation images now. But, before loading the train image, we are going to apply a simple transformation. It can help to increase the amount of relevant data by introducing variations in the dataset.

For a detailed explanation of data augmentation techniques, visit Page

Before applying the final transformations, let’s define the batch size, buffer size, and the final dataset that is going to be transformed. The final dataset is going to be a dictionary with ‘train’ as key with train_dataset as its value and ‘val’ as key with val_dataset as its value.

We are going to apply the following transformations on the dataset.

Map function

Maps map_func across all the elements of the dataset.

This transformation applies map_func to each element of the dataset and returns a new dataset containing the transformed elements, in the same order as they appeared in the input. map_func can be used to change both the values and the structure of a dataset’s elements.

In our case, we are going to map the load_image_train function on each element of the Train dataset and load_image_test function on each element of the Validation dataset.

2. Shuffle

Randomly shuffles the elements of this dataset.

This dataset fills a buffer with buffer_size elements, then randomly samples elements from this buffer, replacing the selected elements with new elements. For perfect shuffling, a buffer size greater than or equal to the full size of the dataset is required.

We are going to use this method only on Train dataset.

3. Repeat

Repeats this dataset so each original value is seen count times. The default behavior (if the count is None or -1) is for the dataset to be repeated indefinitely.

4. Batch

Combines consecutive elements of this dataset into batches.

5. Prefetch

Creates a Dataset that prefetches elements from this dataset.

Most dataset input pipelines should end with a call to prefetch. This allows later elements to be prepared while the current element is being processed. This often improves latency and throughput, at the cost of using additional memory to store prefetched elements.

We can apply all the above transformations on our dataset using the following code.

So, we are done with the data preparation now.

Now, let’s take a sample image from the dataset and visualize it.

Visualizing a sample image from the Dataset

Now that we are done with the Data Preparation, let’s move on to Model Building.

Model Building

Most of the Segmentation Models basically consists of two parts. First is Encoder, in which we downsample the spatial resolution of the input, developing lower resolution feature mappings which are learned to be highly efficient at discriminating between classes.

Then, in the Decoder part, we upsample the feature representations into a full resolution Segmentation Map.

I have implemented and trained two models, UNet and PSPNet on this dataset

U-Net Model

U-Net is a Convolutional network architecture for fast and precise segmentation of images. Until now, it has outperformed the prior best method (a sliding-window convolutional network) on the ISBI challenge for the segmentation of neuronal structures in electron microscopic stacks.

The main idea behind this architecture is to supplement a usual contracting network by successive layers, these layers increase the resolution of the output. In order to localize, high-resolution features from the contracting path are combined with the upsampled output. A successive convolution layer can then learn to assemble a more precise output based on this information.

In the upsampling part, there are a large number of feature channels, which allow the network to propagate context information to higher resolution layers. As a consequence, the expansive path is more or less symmetric to the contracting path and yields a u-shaped architecture. The network does not have any fully connected layers and only uses the valid part of each convolution, i.e., the segmentation map only contains the pixels, for which the full context is available in the input image.

U-Net Architecture

The U-net architecture is symmetric and consists of two major parts —
the first part is called the contracting path (Encoder), which is constituted by the general convolutional process and the second part is the expansive path (Decoder), which is constituted of the Upsampling technique.

Encoder

The contracting path follows the typical architecture of a convolutional network. It consists of the repeated application of two 3x3 convolutions (unpadded convolutions), each followed by a rectified linear unit (ReLU) and a 2x2 max pooling operation with stride 2 for downsampling. At each downsampling step, we double the number of feature channels.

Encoder Code

Decoder

Every step in the expansive path consists of an upsampling of the feature map followed by a 2x2 convolution (“up-convolution”) that halves the number of feature channels, a concatenation with the correspondingly cropped feature map from the contracting path, and two 3x3 convolutions, each followed by a ReLU. The cropping is necessary due to the loss of border pixels in every convolution.

Decoder Code

Now, we will stack the Encoder and Decoder blocks with different filter sizes together into a single model.

Training the model

To train the model, first, we need to define a loss object, an optimizer, and some performance metrics to monitor the performance of the model

Now, we are going to use TensorFlow’s GradientTape function to create custom training loops to train the model. We will discuss how to compute gradients with TensorFlow in eager execution.

With eager execution enabled, Tensorflow will calculate the values of tensors as they occur in your code. This means that it won’t precompute a static graph for which inputs are fed in through placeholders. This means to back-propagate errors, we have to keep track of the gradients of your computation and then apply these gradients to an optimizer.

TensorFlow provides the tf.GradientTape API for automatic differentiation; (automatic differentiation is useful for implementing machine learning algorithms such as backpropagation for training neural networks) that is, computing the gradient of a computation with respect to its input variables. TensorFlow “records” all operations executed inside the context of a tf.GradientTape onto a “tape”. It then uses that tape and the gradients associated with each recorded operation to compute the gradients of a “recorded” computation using reverse mode differentiation.

Gradient tapes use memory to store intermediate results, including inputs and outputs, for use during the backward pass. For efficiency, some ops (like ReLU) don’t need to keep their intermediate results and they are pruned during the forward pass. However, if you use persistent=True on your tape, nothing is discarded and your peak memory usage will be higher.

To learn more about GradeintTape in Tensorflow 2.0, you can visit this Page

Training Code

The following code will train the model for a given number of epochs. First, we call the train_and_checkpoint function which will initially check for a checkpoint. If a previous checkpoint exists, it will resume training ahead from that checkpoint. Then, it will loop through the entire dataset and save the metrics. We use tf.summary.scalar to save the loss and accuracy metrics that we can use for Tensorboard visualization. We then check the performance of the model for validation data. If the accuracy is increased, we save the model weights to the disk. At last, we print the metrics for that particular epoch and also reset the loss and metric objects.

Code

U-Net Performance on Indian Driving Dataset

We can track and visualize metrics such as loss and accuracy using Tensorboard for monitoring the performance of the model.

After training the model for 100 epochs, the graph looks like this:

Accuracy and Loss graphs of Training and Validation data from IDD

Now, let’s predict a single image after loading the best weights and analyze the prediction.

We can also get the True Positive, False Positive, False Negative, and IoU values for each class in the predicted image.

Now, we can calculate the mIoU for all the validation images and then take its mean.

Output for U-Net Model:

Validation mIoU =  0.44561768240488786

PSPNet Model

The PSPNet architecture takes into account the global context of the image to predict the local level predictions hence gives better performance on benchmark datasets like PASCAL VOC 2012 and cityscapes. The model was needed because FCN based pixel classifiers were not able to capture the context of the whole image.

In this model, they have introduced the pyramid pooling module, which empirically proves to be an effective global contextual prior. Global average pooling is a good baseline model as the global contextual prior, which is commonly used in image classification tasks.

The pyramid pooling module fuses features under four different pyramid scales. The coarsest level highlighted in red is global pooling to generate a single bin output. The following pyramid level separates the feature map into different sub-regions and forms pooled representation for different locations. The output of different levels in the pyramid pooling module contains the feature map with varied sizes. To maintain the weight of global feature, we use a 1×1 convolution layer after each pyramid level to reduce the dimension of context representation to 1/N of the original one if the level size of the pyramid is N. Then we directly upsample the low-dimension feature maps to get the same size feature as the original feature map via bilinear interpolation. Finally, different levels of features are concatenated as the final pyramid pooling global feature.

The number of pyramid levels and size of each level can be modified. They are related to the size of the feature map that is fed into the pyramid pooling layer. The structure abstracts different sub-regions by adopting varying-size pooling kernels in a few strides. Thus the multi-stage kernels should maintain a reasonable gap in representation.

Now, we can implement PSPNet using the same Encoder-Decoder Approach we used in U-Net.

The Encoder contains Convolution blocks that will generate Feature Maps till Step (b) in the PSPNet Architecture.

Encoder Code

For the Decoder Block, we need two helper classes, PyramidFeatureMap and PyramidPoolingModule which is an effective global contextual prior.
Sub-region average pooling is performed for each feature map in PyramidPoolingModule.

Decoder Code

Now, we can stack all blocks together and build the Segmentation Model.

Segmentation Model Code

The training phase will be the same as the U-Net model.

PSPNet Performance on Indian Driving Dataset

After training for 50 epochs, the graph looks like this:

We can predict the same sample image using PSPNet model and analyze its performance.

We can also get the True Positive, False Positive, False Negative, and IoU values for each class in the predicted image.

The average mIoU for all the validation images for PSPNet predictions is:

Validation mIoU =  0.4333882146774802

So, we can observe that the performance of PSPNet is not as good as U-Net Model for this dataset.

Conclusion

Two different models (UNet and PSPNet) were implemented.
Both these models have performed well in the past for Semantic Segmentation Challenges.
For this dataset, UNet gives a validation mIoU of 0.44561 while PSPNet gives a validation mIoU of 0.43338.

Further Improvements

Variations of U-Net can be tried on this dataset.
We can add or decrease the number of Convolution blocks in both the models to see if the performance increases or decreases.
Different loss functions that are defined specifically for segmentation models can also be tried.

You can find my complete solution in my Github Repository and if you have any suggestions, please comment or contact me via LinkedIn

References

Thank you for reading!!

👏 Your 3 claps mean the world to me! If you found value in this article, a simple tap on those clapping hands would make my day.

🚀 Please consider following for more tech-related content.

🌟 If you found this blog helpful and would like to stay updated with more content, feel free to connect with me on LinkedIn.