Using YOLOv3 for detection of traffic signs.

Published in

Analytics Vidhya

10 min readSep 20, 2020

Traffic signs provide valuable information to drivers and other road users. They represent rules that are in place to keep us safe, and help to communicate messages to drivers and pedestrians that can maintain order and reduce accidents. Neglecting them can be dangerous.

Most signs make use of pictures, rather than words, so that they are easy to understand and can be interpreted by people who speak a variety of languages. Approximately, 1.35 million people die each year as a result of road traffic crashes. Most of the governments in the world have made it mandatory to follow traffic rules and signs.

In this case study we will learn how YOLOv3 can be used to detect traffic signs. The case study involves data pre-processing, data analysis and YOLO model creation along with data augmentation. We will be using German traffic sign detection benchmark dataset taken from INI Benchmark Website.

Note: Full code is available on my Git-hub repository.

Problem Statement

Traffic sign detection is a challenging real-world problem of high industrial relevance. Even Tesla has recently given an upgrade for its cars to detect traffic lights and stop signs, which is one of the traffic signs. The main objective of this case study is to detect the traffic signs and classify them into one of the four categories (prohibitory, mandatory,danger or other).

Data overview:

Get the data from : data

The dataset consists of 900 images in ppm format. The dataset also contains a txt file which is in CSV format and consists of the ground truth for all traffic signs in the images. The ground truth file contains the image filename, bounding box coordinates which contain the left, top, right and bottom coordinates of traffic signs present in the image and the class id of the traffic sign.

Index:

Step-1: Exploratory Data Analysis.
Step-2: Data pre-processing.
Step-3: YOLO v3.
Step-4: Predictions.
Step-5: Conclusion.

Step-1: Exploratory Data Analysis

There are 43 traffic signs in total which are then categorized into 4 categories namely, prohibitory, mandatory, danger and other.

Loading the data:

To load the data, we only need the txt file containing the ground truth values. We will load this into a pandas dataframe.

loading gt.txt into a dataframe

There are a total of 600 images in training data and a total of 852 rows in the dataset. Multiple rows for an image refers to multiple traffic signs in the same image.

Analyzing data:

Let us visualize one of the images and check its size,

checking size of an image

visualizing an image

All images are of size 800*1360 with 3 channels.

Class id:

Let us now check the class id distribution, at present there are a total of 43 classes of traffic signs.

We can clearly observe that,

The count of id’s 1,2,4,10,12,13 and 38 are high in number when compared to other sign id’s.
The sign id’s 0,19,31 and 38 are least in number.

let us categorize the 43 class id’s into 4 main class id’s namely prohibitory, mandatory, danger and other.

We can observe that the prohibitory class id’s are high in number in the dataset.

size of bounding box:

let us now check the distribution of size of the traffic signs in images,

We can observe that the bounding box size ranges from 32 to 248 pixels.
Most of the boxes lie in 40 to 100 pixel range, very less boxes are more than 150 pixels.

Step 2: Data pre-processing

Creating png images from ppm images:

Let us now create png images from ppm images,

png image creation

The above code creates png images from ppm images and stores it in images folder, this ensures that the images which don’t have a single traffic sign are separated from the images containing traffic signs. This is achieved by using the dataset dataframe.

Creating annotation files:

Annotation files are created for each image. These annotation files contain the image filepath, ground truth values and class id of the traffic signs in the image. These files are stored with xml extension.

Tfrecord file:

The TFRecord format is a simple format for storing a sequence of binary records. Using tfrecords data can be stored and read efficiently. More can be read about tfrecord here,tfrecord

We will create train and validation data tfrecords which will contain raw image data and the bounding box coordinates along with class id’s. The training data consisting of 600 images is split into train and validation data and then tfrecord files are created from the data. The annotation files will be used to create tfrecord files.

The following snippet of code will help us achieve in creating tf record files.

Let us visualize one of the images with bounding box,

Instead of converting to xml and then to tfrecords, the dataframe can directly be used in tf data pipeline.

Step 3: YOLO v3

YOLO v3 is an improvement over previous YOLO detection networks. It is a single-shot detector which also runs quite fast and makes real-time inference possible on GPU devices.

Let us go through some terms required to understand yolo,

Grid Cells:

YOLOv3 divides the input image into an S×S grid. Each grid cell predicts only one object.

Anchor box:

Anchor box is a prior box that could have different pre-defined aspect ratios. These aspect ratios are determined before training by running K-means on the entire dataset. Anchor boxes are assigned to grid cells and we can determine how much the ground truth box overlap with the anchor box and pick the one with the best IOU.

In YOLO v3, we have three anchor boxes per grid cell. And we have three scales of grids. Therefore, we will have 52x52x3, 26x26x3 and 13x13x3 anchor boxes for each scale. For each anchor box, we need to predict 3 things:

1. The location offset against the anchor box: tx, ty, tw, th. This has 4 values.
2. The objectness score to indicate if this box contains an object. This has 1 value.
3. The class probabilities to tell us which class this box belongs to. This has num_classes values.

In total, we are predicting 4 + 1 + num_classes values for one anchor box.

Non maximum Suppression:

YOLO can make duplicate detection’s for the same object. To fix this, YOLO applies non-maximal suppression to remove duplication’s with lower confidence. The detection box with highest score above a threshold value is retained and other boxes are eliminated.

Photo by Python Lessons from Analytics Vidhya

Let’s go through YOLO v3 architecture,

Darknet-53

YOLO v3 uses Darknet-53 as its feature extractor(backbone). The idea of skip connections to help the activations to propagate through deeper layers without gradient diminishing obtained through ResNet is used here.

Diagram from `YOLOv3: An Incremental Improvement`

Since YOLO V3 is designed to be a multi-scaled detector, we also need features from multiple scales. Therefore, features from last three residual blocks are all used in the later detection.

The feature vector obtained from darknet-53 is fed into a multi scale detector. The final output of the detectors will be in shape of [(52, 52, 3, (4 + 1 + num_classes)), (26, 26, 3, (4 + 1 + num_classes)) and (13, 13, 3, (4 + 1 + num_classes))]. The (4 + 1 + num_classes) is because of anchor boxes predictions.

The darknet-53 architecture consists of conv layer which is present in darknetConv function followed by residual blocks. Both conv layer and residual blocks are combined to make a darknet block, the residual layers are repeated as mentioned in the architecture. The skip connections are used and x_36 layer, x_61 layer and the last layer are the feature vectors returned as output by the darknet.

Loss function:

YOLO uses sum-squared error between the predictions and the ground truth to calculate loss. The loss function composes of:

the Localization/Regression loss (errors between the predicted boundary box and the ground truth).
The Confidence loss (the objectness of the box).
the Classification loss.

The first part in regression loss is the loss for bounding box centroid. x and y are the relative centroid location from the ground truth. x’ and y’ are the centroid prediction from the detector. The smaller this loss is, the closer the centroids of prediction and ground truth are. The obj score is 0 if there is no object in the ground truth of those cells, it is 1 if the object is present.The lambda_cord is the weight to put more emphasis on localization instead of classification.

The second part in regression loss is the loss for bounding box width and height. This loss is same as the first part in regression loss.

The second loss is confidence loss. C indicates how likely is there an object in the current cell. We will use binary cross-entropy instead of mean square error here. The second part of confidence loss, noobj_loss is to penalize the network if it proposes object everywhere, so it is used to penalize these false positive proposals. Since there are way too many noobj than obj in our ground truth, we also need this Lambda_Noobj = 0.5 to make sure the network won’t be dominated by cells that don’t have objects.

The last loss is classification loss. If there’re 80 classes in total, the class and class’ will be the one-hot encoding vector that has 80 values. In YOLO v3, it’s changed to do multi-label classification instead of multi-class classification. So each output cell could have more than 1 class to be true. Correspondingly, we also apply binary cross-entropy for each class one by one and sum them up because they are not mutually exclusive. And like we did to other losses, we also multiply by this obj_mask so that we only count those cells that have a ground truth object.

YOLO v3 loss implementation:

The predicted output is of shape (batch-size,grid,grid,anchors,(x, y, w, h, obj, …cls)). The predicted output is passed to yolo_boxes function which returns the pred_box which contains x1,y1,x2,y2 coordinates, pred_obj which contains the predicted objectness score, pred_class which contains the predicted class and pred_xywh which consists of x,y i.e. the predicted center coordinates and w,h i.e the predicted width and height.

The true output is of shape (batch-size,grid,grid,anchors,(x1, y1, x2, y2, obj, cls)). The true output is split into true_box which contains the x1,y1,x2 and y2, true_obj which contains the objectness score and true_class_idx which contains the class. The center coordinates and height and width are obtained from x1,y1,x2,y2 coordinate values.

The box_scale_loss gives more weightage to small boxes. It is same as lambda_coord.

obj_mask is calculated, obj_mask is either 1 or 0, which indicates if there’s an object or not. The best_iou score is calculated by using pred_box, true_box and obj_mask. Using the value of best_iou score, ignore mask is calculated.

The localization loss, confidence loss and classification loss are calculated by using the values which were computed in previous steps. Finally all these losses are added to obtain YOLO loss.

Training:

The training is done by using the train tfrecord file and validation tfrecord file is used for validation..We need to pass the necessary parameters as well. The training code can be seen in my github repository.

Augmentation was also performed on the training images, the augmentations included horizontal flip, Hue, Saturation, shear, scale, translate and rotation. The bounding boxes values must also be changed with each augmentation.

Step 4: Predictions

Further improvements:

We can also train with loss function other than binary cross entropy. We can use other loss functions such as focal loss which might help in classifying the detected object.

The anchor boxes used in yolov3 are those of imagenet. We can create custom anchor boxes for our dataset using K-means on the entire dataset.

Step 5: Conclusion

This was my first experience with computer vision and my first deep learning related case study, I hope you enjoyed reading through it. I got to learn lot of techniques while working on this case study. I thank AppliedAI and my mentor who helped me throughout this case study.

This concludes my work. Thank you for reading!

References:

You can also find and connect with me on LinkedIn and GitHub.