Vision beyond classification: Tasks beyond classification: Task I: Object detection

5 min readApr 9, 2022

Lesson 4 notes from DeepMind lecture series in 2020

What is object detection?

Object detection is a classification and localization task. Its goal is to locate the presence of objects with bounding boxes and their corresponding classes in an image.

Input: An image with one or more objects. (RGB image H x W x 3)

Output: for all the objects present in the image (Figure 1)

Class label: one-hot encoding for the class label. For example: 0 0 0 1 0
Object bounding box: for the location of each object, we output

(xᶜ, yᶜ, h, w) where (xᶜ, yᶜ) is the coordinate of the center, and (h, w) is the corresponding height and width.

**Figure 1**: The desired output for object detection

How to prepare a dataset to train for object detection?

We need N number of samples for training and testing. Each sample contains an RGB image with a list of objects, where we have a one-hot label and a bounding box for the coordinate of each object.

N_train, N_test samples
{'image': p ∈ [0,1], H x W x 3,
 'objects':
  [
   {'label': one_hot(N), 1 x N,
    'bbox': (xᶜ, yᶜ, h, w) ∈ ℝ, 1 x 4},
   {'label': one_hot(N), 1 x N,
    'bbox': (xᶜ, yᶜ, h, w) ∈ ℝ, 1 x 4},
   .
   . 
   .
  ]}

How to learn to predict bbox coordinates?

Since the coordinates of these bounding boxes are real values, mistakes are not quantifiable in classification, we need to use regression to predict bounding box coordinates. We use quadratic loss (Figure 2) to give feedback on the accuracy of a predicted bounding box. So our goal is to minimize the quadratic loss (mean squared error) over samples.

Where:

t: ground truth
x: prediction

Generally, we have more than one object in an image (Figure 2). Thus, we use two steps to identify which ground truth bounding box we should compare with our prediction.

Classification: we first discretize the output values into one-hot label to classify which ground truth bounding box our prediction belongs to.
Regression: once we classify the ground truth bounding box, we then refine our predicted values through regression.

Faster R-CNN:

Faster R-CNN (Figure 3) is a two stage object detector composed of two modules:

1. Region Proposal Network (RPN): a deep, fully convolutional network that takes an image (of any size) as input and predicts a set of rectangular object proposals (candidate bounding boxes), each with an objectness score.

How does the RPN work?

Discretize bbox space: (xᶜ, yᶜ, h, w) (Figure 4)

-> anchor points for (xᶜ, yᶜ): to discretize the space of the center, choose an anchor point and distribute them uniformly over the image.
-> scales and ratios for (h, w): choose candidate bbox with different scales and ratios.

n candidates per anchor: generally, we choose 3 different scales with 3 different ratios.
Predict objectness score for each bbox: we train a classifier to predict if there is an object in a bbox.
Sort and keep top K candidate

2. A detection network: adopt Fast R-CNN (fully connected layers) to use the proposed regions from RPN as input for classification and refinement.

Note: RPN and Fast R-CNN are trained independently and use alternating training technique for sharing convolutional layers.

RetinaNet:

RetinaNet (Figure 5) is a one-stage object detector composed of four components.

1. Feature Pyramid Network (FPN) Backbone: augments a standard convolutional network with a top-down pathway and lateral connections. Hence, the network efficiently constructs a rich, multi-scale feature pyramid from a single resolution input image.

2. Anchors: use translation-invariant anchor boxes similar to those in the RPN variant. Each anchor is assigned:

a length K one-hot vector of classification targets, where K is the number of object classes.
a 4-vector of box regression targets.

3. Classification subnet: a small fully connected network (FCN) that predicts the probability of object presence at each spatial position for each of the A anchors and K object classes. In brief, it performs convolutional object classification on the backbone’s output.

4. Box regression subnet: another FCN that regresses the offset from each anchor box to a nearby ground-truth object. In brief, it performs convolutional bounding box regression.

Note: RetinaNet is the current state-of-the-art model in object detection. However, this one-stage detector achieves top results not based on innovations in network design but due to their novel loss: Focal Loss (FL)

What is Focal Loss (FL)?

Focal Loss (FL) is designed to address the one-stage object detection scenario in which there is an extreme imbalance between foreground and background classes during training (Figure 6). The FL is built on top of the cross entropy (CE) for binary classification, where a modulating factor (1-pt)ᵞ is added to the CE loss, with tunable focusing parameter γ ≥ 0.

Figure 6: The Focal Loss

The FL is used as the loss on the output of the classification subnet. The FL focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training.

Summary:

This post explored the first task beyond classification: object detection and current state-of-the-art deep learning models designed to address it.

Specifically, we addressed:

Object detection: a computer vision task that locates the presence of objects in an image and indicates their location with a bounding box.
Faster R-CNN: a two-stage detector based on a Regional Proposal Network (RPN) and a Fast R-CNN detection network.
RetinaNet: a one-stage detector that uses Focal Loss to address the class imbalance during training.

Coming next is notes from Lecture 4 in DeepMind’s deep learning series: Vision beyond classification: Tasks beyond classification: Task 2: Semantic Segmentation.

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com