Driving a car is often taken for granted. Little do we know that this effortless task is a result of millions of years of evolution. We’re able to identify the objects on the road, estimate their speeds and positions while having an intuitive, futuristic sense of what these objects are going to do. All these happen within a split-second.
Can we give computers the same ability?
Video inputs are an array of images, displayed at a large pace. Images themselves, are matrices with integer values representing different intensities of colors. The advent of Deep Learning algorithms and compute power has finally enabled us to take a shot at these previously untouchable problems.
In Semantic segmentation, the model is tasked with partitioning the image into several components, each representing a different entity.
We are about to attempt the same problem but in a complex road scenario. We use a subset of the Cityscapes Dataset, which has image pairs of a road scene and its segmented counterpart. The dataset can be downloaded here. The dataset contains images taken from vehicles in Germany.
This dataset has 2975 training images files and 500 validation image files. Each image file is 256 x 512 pixels, and each file is a composite with the original photo on the left half of the image, alongside the labeled image (output of semantic segmentation) on the right half.
Step 1. Building the Model
Before we proceed, we have to import the necessary libraries.
Secondly, let’s configure our GPU options. If there is a GPU available, this snippet detects it and sets the device to GPU.
Now, let’s proceed to tidy up our file system locations.
Sample Image Retrieval
Now that we’ve established a file path, let’s fetch a sample image and see if the retrieval process works fine.
Define Output Labels
We now define the labels in the output image. Ideally, every pixel of the input image is annotated and assigned labels by hand. Since we already have the color-coded output image, we just need to segregate these colors as different labels.
I have used a simple K-means algorithm to give these colored labels into 10 different classes. This will be the ground truth by proxy for our model.
To train our clustering model, we initialize
color_array, which is a randomly initialized matrix.
Here, we take the output image (which is color-coded for different objects) and apply our K-means clustering model on it. Let’s see how it turns out.
Define the Dataset
We define the class
CityscapeDataset, which acts as an iterator that returns a single input road scene image
X and its corresponding label image
Y. This vital function is carried out by the function
__getitem__(). Furthermore, the image is normalized using the
transform() function. This reduces the impact that any single channel (R, G, B) can have on the activations and eventually outputs.
Let’s instantiate this class and verify the function.
The UNet Model
The end-end model which we are going to use is the UNet model. In a naive image classification task, the primary objective is to convert the input feature map into a vector. However, in semantic segmentation, we have to reconstruct an image from this vector as well. That’s where UNet comes into the picture. We use the same feature mapping that helped us convert the feature map into the vector, to perform the reverse function. Thus, we have the contracting block and the expansive block.
Voila! The model has been created. Let’s proceed to train the model. To get a more comprehensive view of the model, check out this article.
Step 2. Training the model
We train the model for 10 epochs using a
0.01. Further, we use the
Adamoptimizer to evaluate the model and guide the gradients through the state space respectively.
Let’s plot the training losses to observe the trend.
Let’s save the model and move on to the model predictions section.
Step 3. Check model predictions
We load the saved model and fetch the test set. We then iterate through the set and predict the new, unseen images.
Step 4. IOU Score
Although we can visually observe that the model has done alright, we need to quantify this by providing a metric score. We have these metrics to do just that. I have chosen the mean overall IOU. Finally, we calculate this value. Read more on IOU here.
This shows that the overlap between the actual labels and the predicted outputs. In our case, we get around 97%. Not bad for a first try!
The sensitivity of the application domain means that we require much greater accuracy than this. To improve the model further (or to get the whole source code given above), you can have this starter notebook and build upon it.
A big shoutout to Jovian.ml for the platform to save and embed the notebooks.
Show some love by clapping off folks!