Tutorial: Understanding Intersection over Union and Non-Maximum Suppression

Jesse Annan
5 min readApr 22, 2024

--

A simple implementation of Intersection over Union (IoU) and Non-Maximum Suppression (NMS) for Accurate Object Detection in PyTorch (for easy understanding)

The task of object detection involves localizing (drawing a bounding box around an object in an image) and classifying (identifying what object(s) is/are within an image, e.g: cat) all objects in a visual domain, such as an image or video. Earlier object detection approaches, like R-CNN and Faster R-CNN, utilize the concept of region proposal from the images or feature maps to precisely predict which object the region contains and the extent of the object in the visual domain. The object in the visual domain can be any label in the dataset or a background (any object not considered a label), and the extent of the object is represented by the x, y coordinates of the object in the visual domain. Historically, these coordinates are simplified to the top-left (x1, y1) and bottom-right (x2, y2) values. A challenge with region proposal methods is that they may propose more regions (for example, R-CNN’s selective search algorithm proposes approximately 2000 regions) than the number of actual objects of interest in the visual domain. So, how do we evaluate the quality of these proposed boxes and determine which box captures the object of interest most accurately? There are two common methods to evaluate the task of object detection:

  • Intersection over Union (IoU)
  • Non-Maximum Suppression (NMS)

These techniques are used to refine the accuracy of object localization and classification in object detection tasks.

Object Detection Algorithm
Image Source — Mathworks

INTERSECTION OVER UNION

As the name suggests, the IoU measures the similarity between any two bounding boxes by calculating the ratio of their intersection to their union. This metric helps quantify the likelihood that two boxes contain the same object of interest or, more broadly, capture overlapping areas within an image. IoU also reflects the uniqueness of bounding boxes. A smaller IoU between two boxes indicates that they cover distinct regions of an image or capture different objects. A higher IoU signifies greater overlap, indicating that the bounding boxes roughly capture the same object.

Intersection over Union
Intersection over Union; Image Source — DataCamp

NON-MAXIMUM SUPPRESSION (NMS)

When using, say, Faster R-CNN, how do we determine which proposed anchor box captures the object of interest most accurately? NMS is pretty much the judge of this. NMS is a metric used to eliminate redundant boxes and retain only unique boxes based on their IoU and objectness score. The objectness score reflects the model’s confidence that a bounded area contains an object of interest, regardless of its class. Here’s how NMS works:

  1. NMS starts by selecting the box with the highest objectness score.
  2. It then estimates the IoU of this box with other predicted boxes that have lower objectness scores.
  3. Boxes with IoU greater than a specified threshold (e.g: 0.7) are eliminated, retaining only the most relevant and distinct bounding boxes.

In summary, NMS and IoU assists in refining the selection of proposed boxes by a model/algorithm (anchor boxes) by prioritizing those with high objectness scores and removing redundant or overlapping predictions.

Non-Max Suppression
Before [left] and after [right] applying NMS; Image Source — DataCamp

A simple implementation of IoU and NMS in PyTorch

Loading the dataset and creating data loader

# loading helpful libraries

import torch
import matplotlib.pyplot as plt
from torchvision import transforms
from torch.utils.data import DataLoader
from torchvision.datasets import ImageFolder

torch.manual_seed(6) # random seed for reproducibility

""" What we need:
1. Transformations for our images
2. Dataset (structured this way):
|__dog-vs-cat
|__ dog (folder with dog images)
|__ cat (folder with cat images)
3. Dataloader
"""
# transformations for our images
trainformers = transforms.Compose([
transforms.PILToTensor(), # image pixel range [0, 255]
transforms.Resize((224, 224)),
])

# loading our dataset
train_dataset = ImageFolder(
root='./datasets/dogs-vs-cats/train',
transform=trainformers,
)

# creating train dataloader
trainloader = DataLoader(
dataset=train_dataset,
batch_size=1,
shuffle=True,
)

# now, we can inspect an image in out dataset (dataloader)
sample_image, image_label = next(iter(trainloader)) # dimension: [batch_size, channels #, image height, image width]
sample_image.squeeze_(0) # removes the batch_size
# swap the channel number with the image (height) width position
this_image = sample_image.permute(1, 2, 0) # dimension: [image height, image width, channels #]
plt.imshow(this_image)
plt.title(
f'Ground truth label: {train_dataset.classes[image_label]}'
)
plt.show()
Cat Image
Dataset source — Kaggle

Drawing Sample Bounding Boxes

# drawing (arbitrary) bounding boxes on our sample image above
from torchvision.utils import draw_bounding_boxes

# x_min, y_min, x_max, y_max
box1 = [20, 5, 190, 220] # green box
box2 = [15, 30, 210, 200] # red box
bounding_box_tensor = torch.tensor([box1, box2])

# adding the bounding box tensors on our sample image
draw_boxes = draw_bounding_boxes(
image=sample_image,
boxes=bounding_box_tensor,
width=2,
colors=['green', 'red'],
)

# convert the image to PIL image for visualization
img_transformer = transforms.ToPILImage()
sample_boxed_image = img_transformer(draw_boxes)

plt.imshow(sample_boxed_image)
# lets add arbitrary objectness scores
plt.text(107, 17, "P(cat) = 0.95", color="green", fontsize="large", fontweight="bold")
plt.text(25, 193, "P(cat) = 0.80", color="red", fontsize="large", fontweight="bold")
plt.show()
Cat Image with two bounding boxes

Calculating IoU

from torchvision.ops import box_iou
from numpy import round

# our bounding boxes: x_min, y_min, x_max, y_max
box1 = [20, 5, 190, 220] # green box
box2 = [15, 30, 210, 200] # red box

iou_score = box_iou(
torch.tensor(box1).unsqueeze(0), # convert boxes from list to tensors
torch.tensor(box2).unsqueeze(0)
)

print(
f'IoU score = {round(iou_score.item(), 2)}'
)
# score = 0.71 (approximately)

IoU score = 0.71 (approximately)

Calculating NMS

from torchvision.ops import nms 

# our bounding boxes and scores tensors
box1 = [20, 5, 190, 220] # green box with score 0.95
box2 = [15, 30, 210, 200] # red box with score 0.80
bounding_box_tensor = torch.tensor([box1, box2]).float()
scores_tensor = torch.tensor([0.95, 0.80])

choose_box_idx = nms(
boxes=bounding_box_tensor,
scores=scores_tensor,
iou_threshold=0.70
)

keep_box = bounding_box_tensor[choose_box_idx]
print(
f'Best bounding box: {keep_box}'
)
# we keep the green box :D

Best bounding box: [ 20, 5, 190, 220 ]

Cat Image with a green bounding box
Expected Prediction

Resources

  1. Justin Johnson’s Deep Learning for Computer Vision; Lecture 15
  2. DataCamp’s Deep Learning in Python

“May The Data Be With You” — Anonymous

--

--

Jesse Annan

I write about what I learn. "Feedbacks are welcomed :D". Reach me at:: www.linktr.ee/jesseannan