Road Damage Detection for multiple countries using YOLOv3

Published in

Analytics Vidhya

14 min readJan 13, 2021

As a part of self case study on Deep learning, I selected a problem on detecting Road damage for multiple countries which is a major issues in almost every countries around the globe.

In this blog, I will explain you how I approached the problem and solved this problem using YOLOv3 architecture inorder to detect road damages for Indian and Japan roads.

Content:

Business Problem
Dataset
Exploratory Data analysis
YOLOv3
Data Preparation
Yolo Boxes
Yolo Loss
First cut approach
Inferences
Generating Custom anchor boxes
Model Training with augmentation and using custom anchor boxes
Predictions Evaluation
Final Model
Model Quantization
Streamlit App
Future Work
References

Business Problem:

Road infrastructure plays crucial role in saving lives and economic development of a country, so to reduce the road accidents due to potholes and damaged roads it’s an important task to manage and inspect the roads on a timely basis because roads deteriorate over time considering various factors related to location,age, temperature etc. Visual inspection of roads by engineers is very time consuming given the extensive length of roads or highways. So to come up with an automated AI based solution which can detect the type of damage can help and improve the way of monitoring of road conditions.

So the main agenda of this problem is to analyze how can we utilize the Japanese dataset to detect road damages in other countries by adding that country’s images where AI system has to develop.

Dataset:

For this problem, I collected dataset from this link.

This dataset consists of three zip files one for training data and other two for test data containing Images and xml files with annotations. And for this problem

train.tar : This file comprising of 26620 images and its respective annotation xml files collected from three different countries i.e Japan, India and Czech , Japan dataset has more number of images as compared to India and Czech generated using smartphones.
test1.tar : This file comprising of 2282 images except without annotation files as I have to test model performance on this images.

Exploratory Data analysis:

As a part of EDA, I analysed the distribution of damge types for the given dataset country wise. In order to check how my model will perform according to distribution of classes.

For this problem only 4 types of damage categories are considered:

D00: Longitudinal Crack
D10: Transverse Crack
D20: Aligator Crack
D40: Pothole

EDA for India dataset:

Observation:

From the above plot, most of the roads damages in India are of D40 category i.e potholes.
Followed by D20 category i.e Alligator crack and D00 category i.e Longitudinal Crack
And D10 damage category i.e transverse crack are very less in Indian images. so this can cause some issues while evaluating performance on test data which contains D10 damage type when considered only indian images as train data.

EDA for Japan dataset:

Observations:

From above plot, damage type with D20 category i.e alligator cracking is most common on Japan roads.
And categories D00 and D10 are equally distributed for Japan images.
In Japan, roads are very much less damaged from pothole as compared to alligator cracking.

EDA for Czech dataset:

Observations:

In czech, most road damages are of D00 categories i.e longitudinal cracks.
Followed by D10(Transverse Crack) and potholes(D40)
Road damages with alligator cracks are very less in Czech as compared to other types of road damages.

YOLOv3:

In order to solve this road damage detection problem, I selected YOLOv3 architecture to train my model and detect road damage types which is one of the state of art techniques for object detection problems.

I built and ported the YOLOv3 architecture in to my jupyter notebook. For detailed architecture and explanation about YOLOv3 you can refer this blog.

And if you want to refer the original research paper for YOLOv3 you can find it here.

Data Preparation:

First part before feeding the data into the model is to convert the given dataset according to trainable model format.

For data preparation, for augmenting images and its bounding boxes, I used imgaug library in order to implement augmentation on training data.

As a part of data augmentation, I used FlipLR augmentation technique which augments the images by horizontally flipping it.

For feeding the data into the model, I prepared train and validation dataset in form of TFRecords.

TFRecord format is a simple format for storing a sequence of binary records. TFRecord stores the data as a sequence of binary strings. For more explanation about TFRecord you can check out this blog.

But before creating TFRecord, I parsed the annotations xml files in order to create TFRecord using recursion.

After parsing xml files and generating dictionary for each tags, I implemented function for creating TFRecords.

YOLO Boxes:

YOLOv3 outputs relative coordinates of the bounding boxes not the absolute coordinates, so inorder to calculate absolute coordinates I implemented yolo_boxes function.

The predictions of YOLOv3 will be (batch_size, grid, grid, anchors, (tx, ty, tw, th, obj, …classes)), now in order to get absolute boxes coordinates as the author says in the paper.

bx = sigmoid(tx) + Cx
by = sigmoid(ty) + Cy

Here bx and by are the absolute coordinates that is usually used as centroid coordinates of the boxes for the image and Cx and Cy represents absolute location of the top-left corner of the current grid cell. And tx and ty are the centroid location relative to the grid cell. To know about in depth about grids, anchor boxes and grid cell you can refer this blog.

But applying sigmoid on relative centroid coordinates and adding it with coordinates of Grid cell wont be enough, I normalized it by dividing the result with grid size.

After getting absolute centroid coordinates, applied sigmoid function to objectness scores and classes.

And bw,bh in the paper is width and height of the boxes and pw and ph are the anchor boxes coordinates. So according to paper, inorder to get absolute height and width of predicted bounding box.

bw = exp(tw) * pw
bh = exp(th) * ph

Here pw,ph are the height and width of anchor boxes and bw,bh are absolute height and width.

Now as we calculated the position of the center of the bbox but we need the corners position so to get the corners position I applied below code and concatenated the outputs.

YOLO Loss:

The loss function for YOLO consists of four parts centroid loss(xy), width and height loss(wh), objectness loss and classification loss.

Before calculating the YOLO loss, I calculated various masks required for computing loss function i.e obj_mask and ignore_mask.

obj_mask : Mask which specifies whether object is present or not. 1 if present, 0 if not.

ignore_mask : Mask specifies if the iou of true boxes and predicted boxes is below threshold then ignore those predictions.

As in the paper, author mentioned that YOLOv3 model struggles with small objects, I gave higher weights to small boxes using below code snippet.

Here in YOLO loss function the centroid loss and width-height loss pose as regression problem because YOLO is predicting the coordinates which are real valued so I used squared loss to minimize by summating over all boxes coordinates. In addition to that I multiplied the squared loss with box_loss_scale which gives higher weightage to smaller boxes and with obj_mask which specifies whether the object is present or not.

And after computing squared loss, I computed binary cross entropy loss for objectness scores and for classification scores.

But while computing loss for objectness there will be some cases where model will predict the bounding boxes which are not in groundtruths for the images, so inorder to handle that case while computing loss for objectness I gave weightage to objectness loss with obj_mask if object is present and summed it up with weightage to objectness loss with (1-obj_mask) and ignore_mask.

After computing objectness loss , I calculated cross entropy loss for each present classes for this problem. I had used sparse categorical cross entropy loss.

After computing all 4 parts of loss the final loss for the model will be summation of all xy_loss, wh_loss, obj_loss and class_losss.

First cut approach:

As a first cut approach, I trained the model using India and Japan dataset without augmentation and by precomputed anchor boxes.

I trained models on using Japan dataset only, on Indian dataset only and third one by combining Japan and India dataset.

For model training I first loaded the pretrained weights of darknet layer which is trained on large COCO dataset consisting of 80 classes into my model and freezed the darknet layer in my model i.e throughout training weights in darknet layer will not change.

Pretrained weights can be download from this link.

After loading weights and freezing the darknet layer of my model, I used Adam optimizer for optimization with learning rate 0.001. For more detailed explanation about various optimization algorithm you can refer this blog.

After creating instance of Adam optimizer, I compiled the model using loss function as YOLO loss.

In above code yolo_anchors is a list of tuples of 9 anchor boxes coordinates precomputed using clustering technique called K means on COCO dataset. And I have trained the model by keeping 0.4 as IOU threshold.

yolo_anchor_masks : Masks tells the layer which of the anchor boxes is responsible for predicting bounding boxes.

The first layer predicts 6,7,8 because those are the largest boxes, second layer predicts smaller ones etc.

After that I trained the model with callbacks for changing the learning rate i.e used ReduceLROnPlateau, for early stopping and for generating tensorboard logs and saving weights.

And after training the model on different combination of datasets, I evaluated the model by visualising plots for missed predictions and extra predictions which I explained in Predictions Evaluation sections.

Inferences :

After training the model by early stopping at 10th epoch, at first I inferenced the trained model on trained dataset and compared it with groundtruth that whether model is able to predict correctly or not. And then I inferenced my model on test images.

Inferences of Model trained on only Japan dataset:

For Japan roads on train images:

For Japan roads on test images:

Inferences of Model trained on only Indian dataset:

For Indian roads on train data:

For Indian roads on test data:

Inferences of Model trained on combined dataset:

For Japan roads on train images:

For Japan roads on test images:

For Indian roads on train images:

For Indian roads on test mages:

Now when you compare the inferences of the model trained on Indian and Japan roads dataset separately and when the model trained on combining dataset of India and Japan, model predictions were improved.

Generating Custom anchor boxes:

After implementing First cut approach, I computed anchor boxes as per the given dataset for this problem.

Anchor boxes are computed using K means clustering algorithm as the author described in the paper that author used K means to compute anchor boxes. You can refer this blog for more detailed explanation of K means algorithm.

I used K=9 because I have to generate 9 anchor boxes coordinates. And applied K means clustering technique using IOU metric on the given dataset.

Model Training with augmentation and using custom anchor boxes :

After computing anchor boxes as per the given problem dataset, I trained the Model on augmented train data and using custom anchor boxes on Japan and Indian Dataset separately and on Combined Dataset.

Rest I had used same Adam optimizer with learning rate 0.001 and with IOU threshold of 0.4.

Inferences on Japan dataset:

On Train images:

On Test images:

Inferences on Indian dataset:

On train Image:

On Test image:

Predictions Evaluation:

For evaluating predictions of my models on train data, what I did is like first I ran the trained model and inferenced it on all train Images and saved the predictions in text file with format:

<image_name> <class_name> <confidence_score> <bounding boxes coordinates>

And created another text file with above format on groundtruth using annotation files.

After generating text files, for each class I plotted the bar plots to visualize the missed predictions, extra predictions in order to check by how much my models are predicting extra bounding boxes if any.

For Model 1:

Plots for model trained on combined Dataset:

Plots for Model trained only on Japan dataset:

Plots for Model Trained on Indian Dataset:

For Model 2 using data augmentation and custom anchor boxes:

Plots for Model trained only on Japan dataset:

Plots for Model trained only on Indian dataset:

Observations:

From above bar plots, when the model is trained on separate dataset for 2 countries i.e Japan and India models miss predictions are very high for some types of damages but when model is trained by combining both the dataset then model improved and can able to detect the road damages better as compared to models trained on individual dataset
For Model 1 Japan dataset, the model predicted extra road damages of category D40 i.e potholes as compared to Model2 and the model trained on Combined dataset.
Prediction Misses for D40 category is very high because as YOLOv3 doesnot give better results for small objects and potholes which are generally consider as smaller objects, so model may have missed those objects.

Final Model:

I selected the model trained on Combined dataset as the final model for quantizing and for deploying it on Streamlit.

Model Quantization:

After selecting the Final Model, I tried post training quantization.

Post training quantization is a technique to reduce the model size also improving CPU latency.

I applied post training float 16 quantization which quantizes the weights to float16 using TFLiteConverter which reduced the model size by 2x and also improved the latency on CPU as compared to unquantized model.

There are various other post quantization techinques are there like you can quantized the model weights to integer which is 4x smaller than the original model and 3x faster not only in GPU but integer quantized models works fairly well and faster on CPUs,TPUs and microcontrollers. You can refer this article for more post quantization techniques.

Model Size:

After quantizing the model using post training float 16 quantization, my model size reduced to 2x.

I also measured the prediction rate of the quantized model and unquantized model for each image of test data on CPU and GPU.

Prediction Rates on CPU:

Observations:

From above table comparison, we can conclude and observe that our experiments on CPU with quantized model gave slightly faster results then Model without quantization
So this model can be inference faster on CPU than GPU and hence if machine has only CPU architecture then Quantized model can be useful for inferences or predictions

Prediction Rates on GPU:

Observations:

From above table comparison, we can conclude and observe that our experiments on GPU with unquantized model gave faster results then quantized Model
So if computer architecture has GPU then unquantized model can inference faster than Quantized model.

Streamlit App:

After the training of my model and selecting final model, I created streamlit app for this problem.

Due to storage limitations, I limited the train and test data to 100 and 60 images to select in the app, but if you have images of damage road from India and Japan I added the option to test on captured image where the model api will predict the type of road damage if any in uploaded image.

Even you can play with confidence threshold and IOU threshold to check predictions at various thresholds on the images.

You can refer this link to try the app.

Future Work:

My work can be improved as a part of future work is below:

Right now we have improved techniques of YOLO like YOLOv4 and YOLOv5, so road damage detection can be trained using improved techniques as a part of future work.
Various other augmentation techniques can be used in order to improve the model as mentioned in data preparation part I had used FlipLR technique but there are various other techniques for augmenting data as well as bounding boxes.
One can also use various other quantization techniques apart from float 16 quantization technique as a part of future work.
As future work, one can also try to train the model using Czech dataset which I had not used for training apart from India and Japan.
As a part of future work, the above model can be extended to any other country apart from Japan and India, if you have the clicked images using smartphone or any other device of the country where you want to detect road damages my models can be fine tuned by combining all the dataset including the dataset for the country where model has to be trained or by training it from scratch using those dataset.

You can refer my codes, trained model and ipython notebook in my github repo.

If you liked my blog, do not forget to Clap it and share it.

Thank you for reading.