Semantic Segmentation in Self-driving Cars

A bare-bones intro to distinguish objects in a road scene with Autonomous Vehicles using PyTorch 🔥

4 min readJul 1, 2020

Driving a car is often taken for granted. Little do we know that this effortless task is a result of millions of years of evolution. We’re able to identify the objects on the road, estimate their speeds and positions while having an intuitive, futuristic sense of what these objects are going to do. All these happen within a split-second.

Can we give computers the same ability?

Video inputs are an array of images, displayed at a large pace. Images themselves, are matrices with integer values representing different intensities of colors. The advent of Deep Learning algorithms and compute power has finally enabled us to take a shot at these previously untouchable problems.

Introduction

In Semantic segmentation, the model is tasked with partitioning the image into several components, each representing a different entity.

We input this image (left), the model identifies the cat, grass, the mountain, and the sky in the output (right)

We are about to attempt the same problem but in a complex road scenario. We use a subset of the Cityscapes Dataset, which has image pairs of a road scene and its segmented counterpart. The dataset can be downloaded here. The dataset contains images taken from vehicles in Germany.

This dataset has 2975 training images files and 500 validation image files. Each image file is 256 x 512 pixels, and each file is a composite with the original photo on the left half of the image, alongside the labeled image (output of semantic segmentation) on the right half.

Step 1. Building the Model

Setup

Before we proceed, we have to import the necessary libraries.

Import libraries

Configure GPU

Secondly, let’s configure our GPU options. If there is a GPU available, this snippet detects it and sets the device to GPU.

Enable GPU if it is available

File System

Now, let’s proceed to tidy up our file system locations.

Sample Image Retrieval

Now that we’ve established a file path, let’s fetch a sample image and see if the retrieval process works fine.

Fetching a sample input & output pair

Define Output Labels

We now define the labels in the output image. Ideally, every pixel of the input image is annotated and assigned labels by hand. Since we already have the color-coded output image, we just need to segregate these colors as different labels.

I have used a simple K-means algorithm to give these colored labels into 10 different classes. This will be the ground truth by proxy for our model.

To train our clustering model, we initialize color_array, which is a randomly initialized matrix.

Initialize the model and train the model

Here, we take the output image (which is color-coded for different objects) and apply our K-means clustering model on it. Let’s see how it turns out.

Split the sample image into its input and output components

Apply the clustering algorithm on a sample image

Define the Dataset

We define the class CityscapeDataset, which acts as an iterator that returns a single input road scene image X and its corresponding label image Y. This vital function is carried out by the function__getitem__(). Furthermore, the image is normalized using the transform() function. This reduces the impact that any single channel (R, G, B) can have on the activations and eventually outputs.

Let’s instantiate this class and verify the function.

The UNet Model

The end-end model which we are going to use is the UNet model. In a naive image classification task, the primary objective is to convert the input feature map into a vector. However, in semantic segmentation, we have to reconstruct an image from this vector as well. That’s where UNet comes into the picture. We use the same feature mapping that helped us convert the feature map into the vector, to perform the reverse function. Thus, we have the contracting block and the expansive block.

The UNet model

Voila! The model has been created. Let’s proceed to train the model. To get a more comprehensive view of the model, check out this article.

Step 2. Training the model

We train the model for 10 epochs using a learning_rate of 0.01. Further, we use the CrossEntropyLoss and Adamoptimizer to evaluate the model and guide the gradients through the state space respectively.

Define parameters

Load the data

Instantiate the model

Define the Loss function and the Optimizer

Train the model using the training data

Let’s plot the training losses to observe the trend.

Let’s save the model and move on to the model predictions section.

Step 3. Check model predictions

We load the saved model and fetch the test set. We then iterate through the set and predict the new, unseen images.

Load the saved model

Step 4. IOU Score

Although we can visually observe that the model has done alright, we need to quantify this by providing a metric score. We have these metrics to do just that. I have chosen the mean overall IOU. Finally, we calculate this value. Read more on IOU here.

This shows that the overlap between the actual labels and the predicted outputs. In our case, we get around 97%. Not bad for a first try!

Final thoughts

The sensitivity of the application domain means that we require much greater accuracy than this. To improve the model further (or to get the whole source code given above), you can have this starter notebook and build upon it.

Do check out Gokul Karthik’s implementation here. Give his repo a visit for some cool Deep Learning projects. This project implements an array of models that tackles semantic segmentation.

A big shoutout to Jovian.ml for the platform to save and embed the notebooks.

Show some love by clapping off folks!