Object detection: using non-max supression over YOLOv2

Sarang Zambare
Mar 8, 2019 · 7 min read
Sample output: Walking around in Berkeley, CA

github: https://github.com/sarangzambare/object-detection

YOLOv2 (You only look once) is one of the most popular algorithms for object detection. As the name implies, the predictions of objects, and their bounding boxes are calculated as a single forward pass through the convolutional neural network, making it suitable for real time object detection.

In this repository, I make custom preprocessing methods to be operated on the output of the YOLOv2, to detect common objects encountered while driving in an urban environment.

What objects are we detecting ?

This program uses the COCO (Common Objects in Context) class list, which has 80 object categories. For reference, these classes are given in the file “coco_classes.txt”, and the first few entries are :

traffic light
fire hydrant
stop sign
parking meter


The training data consists of images labelled with the objects it contains, along with the bounding box for each object. The way this is done is in the form of a vector encoding 6 entities, as given below :

where Bx, By represents the centre of the box, and Bh and Bw are the height and width of the box respectively. The class label c is actually a vector of length 80, representing all the 80 classes.

The YOLO model :

YOLO calculates the probabilities and bounding boxes for objects in a single forward pass.

The way this is done is splitting the image into a 19x19 grid (different grid sizes are possible too) and detecting objects for each of the 19x19 boxes. However, this is not done sequentially, and is actually implemented as a convolution operation. Meaning that the individual filter shapes and the depth is chosen in such way that the output is 19x19xchannels.

Having done this, the value in the i,j pixel of the output, represents the information about detecting objects in the i,j block of the grid.

For example:


  • Input image size is (448,448,3)
  • Number of classes to predict = 25
  • Grid size = (7,7)

Then, each label vector will be of length 30 (Pc, Bx, By, Bw, Bh, +25 classes).

To have such an output shape, an example architecture will look like :

Anchor boxes :

Anchor boxes are predefined boxes of fixed height and width. The idea is to use a finite number of anchor boxes, such that any object detected fits snugly inside at least one of the predefined boxes. The reason this is done is to limit the number of possible values of Bh and Bw (infinite) to only a few predefined values.

Anchor boxes are chosen based on application. The crux of the idea is that different objects fall in different ratios of height and width. For example, if the classes in this program only had “person” and “car” , we could’ve used only two anchor boxes :

This approach reduces computational cost, which is of essence particularly in real-time detection.

Visualizing YOLO :

In this program, I am using an input images which are preprocessed into being 608x608 pixels. The grid size I am using is 19x19. Hence, the following specifications:

  • Input shape is (m,608,608,3), where m is the number of images per batch.
  • Number of anchor boxes = 5
  • Number of classes to predict = 80
  • Output shape is therefore (m,19,9,5,85)

If the output shape is confusing, think about it as returning 5 boxes per block, and each of the 5 boxes has a 85 length vector which consists of [Pc,Bx,By,Bh,Bw, + 80 classes]

Graphically :

To predict objects and their bounding boxes,

  • For each of the 361 boxes (192), we calculate probability scores by multiplying Pc by the individual class probabilities.

After doing this, if we colour each of the 361 blocks with its predicted class, it gives us a nice way to visualize what’s being predicted :

Another way to visualize what’s being predicted is to plot the bounding boxes themselves. Without any filtering, there is a large number of boxes, with many boxes pertaining to the same object :

Non-max suppression and threshold filtering:

As is evident from the above figure, the yolo algorithm outputs many boxes, most of which are irrelevant/redundant.Hence we need a way to filter and chuck out the unneeded boxes.

The first step, quite naturally, is to get rid of all the boxes which have a low probability of an object being detected. This can be done by constructing a boolean mask (tf.boolean_mask in tensorflow), and only keeping the boxes which have a probability of more than a certain threshold.

This step gets rid of anomalous detections of objects. However, even after such a filtering, we end up with many boxes for each object detected. But we only need one box. This bounding box is calculated using Non-max suppression.

Non-max suppression makes use of a concept called “intersection over union” or IoU. It takes as input two boxes, and ss the name implies, calculates the ratio of the intersection and union of the two boxes.

For example, given two boxes A and B , IoU is calculated as :

Having defined the IoU, non-max suppression works as follows :

Repeat Until no boxes to process:

  • Select the box with highest probability of detection.
  • Remove all the boxes with a high IoU with the selected box.
  • Mark the selected box as “processed”

This type of filtering makes sure that only one bounding box is returned per object detected. (Exceptions include conditions when there are more than one objects in one grid block, but I omit that case here)

Demonstration : Detecting things around Berkeley:

I put this program to test, on footages shot around Berkeley, CA. I chose a time when the weather was good and there was a lot of sun.

The pre trained model was trained on images from drive.ai, which were labelled as given in the first section of this article. The model can be found in the folder “model”

The videos were recorded using iPhone 8, and then resized into 1280x720.

Internally, the program resizes these images into squares of (608,608), as described in the previous sections.

Here are the results :

As can be seen in the above image, not all instances of the predicted classes are detected. For example, there are two cars and a bicycle which did not get detected in the above image.

This is because I am using only 5 anchor boxes for simplicity, and not all instances of the classes might fall into the five anchor boxes at every frame of the video.

Feel free to try running the code using more anchor boxes. The heights and widths of the anchor boxes can be modified in the folder “anchors”

Finally, here is a GIF of a video that I passed through this program.

If you analyse it frame by frame, you will notice that it still makes a lot of mistakes. Most of the mistakes are in the form of missing out some objects.

Some of them are outright stupid, like this one :

Cellphone ?! Whaa ???

But overall, even this basic model performs well. Especially regarding important classes like “car” and “person”.

References :

  1. You Only Look Once (YOLO): Unified, Real-Time Object Detection, Redmon, Divvala, Girshick, Farhadi
  2. YAD2K: Yet Another Darknet 2 Keras, Allan Zelener

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade