Object Detection and Classification in Computer Vision

Rina Mondal
5 min readJan 2, 2024

--

In computer vision, the combined task of object detection and classification refers to the simultaneous identification, localization (finding the boundary box of the object), and categorization (identifying the category) of multiple objects within a single image.

In this blog, we will discuss about different approaches of Object detection and classification.

  1. Sliding window approach
  2. Region based Convolutional Neural Network
  3. Fast R CNN
  4. Faster R CNN
  5. YOLO
  6. SSD
  1. Sliding window approach: In this method, we take an input image and apply CNN (convolutional neural networks) to many different tiny crops of the image to classify each crop as object or background to find the object in the image. In other words, it involves systematically moving a rectangular window of a fixed size across an input image and classifying the content within each window. However, in this method CNN needs to be applied to huge number of locations and scales, it becomes computationally expensive.

2. Region-based Convolutional Neural Network (R-CNN): R-CNN operates by first proposing potential object regions in an image using an external algorithm, such as selective search. These regions are then individually passed through a pre-trained convolutional neural network (CNN) to extract features. Subsequently, support vector machines (SVMs) classify the content of each region into different object classes, and bounding box regressors refine the proposed bounding box coordinates.

3.Fast R-CNN : Fast R-CNN addressed the computational inefficiencies of its predecessor. Instead of processing each region proposal independently, it passed the entire image through a convolutional neural network (CNN) to generate feature maps. These feature maps were then used for region of interest (RoI) pooling, allowing efficient extraction of fixed-size feature vectors for each proposed region. The model combined the tasks of region proposal, feature extraction, and object classification into a single, end-to-end trainable network, significantly speeding up both training and inference.

Drawbacks:

  1. Fast R-CNN involves multi-stage training, including region proposal generation, CNN feature extraction, and classification. This process can be computationally intensive and time-consuming, especially when dealing with large datasets.
  2. Fast R-CNN typically requires fixed-size input images during training and testing. Handling images of variable sizes can be challenging without additional pre-processing steps.

4. Faster R-CNN: In this method, we take our entire image, run the entire image altogether through some convolutional layers to get some convolutional feature map representing the entire high resolution image and now there is a separate region proposal network which works on top of those convolutional features and predicts its own region proposals inside the network. Now, once we have those predicted region proposals then it looks just like fast R-CNN where now we take crops from those region proposals from the convolutional features, pass them up to the rest of the network.

Faster R-CNN

This whole family of method R-CNN, Fast R-CNN and Faster R-CNN are known as Region family method. But there’s another family of methods for object detection which is sort of all feed forward in a single pass.

Two of them are 1. Yolo (You only look once) 2. SSD (Single Shot Detection)

The idea is that rather than doing independent processing for each of these potential regions instead we want to try to treat this like a regression problem and just make all these predictions all at once with some big convolutional network.

Working Methodology of YOLO:

  1. Grid Division: The input image is divided into a grid. YOLO typically uses a 7x7 or 13x13 grid.
  2. Bounding Box Prediction: Each grid cell is responsible for predicting bounding boxes. YOLO predicts multiple bounding boxes for each grid cell, along with a confidence score representing the likelihood that the box contains an object.
  3. Class Prediction: Each bounding box is associated with class probabilities. YOLO predicts the probability distribution of different classes for each box.
  4. Confidence Score: The confidence score reflects how certain the algorithm is that the predicted bounding box contains an object. It considers both the accuracy of the box coordinates and the class prediction.
  5. Non-Maximum Suppression: After predictions are made for the entire image, a post-processing step called non-maximum suppression is applied. This step removes duplicate or low-confidence predictions, retaining only the most confident and accurate ones.
  6. Final Detection: The remaining bounding boxes with their associated class labels and confidence scores are considered as the final detections.

SSD (Single Shot Detection):

Feature Extraction: SSD uses a base convolutional neural network (CNN), often based on a pre-trained model like VGG, ResNet, or MobileNet, to extract features from the input image.

Multiscale Feature Maps: Instead of using a single grid as in YOLO, SSD uses multiple feature maps at different scales. Each feature map is responsible for detecting objects of specific sizes.

Default Boxes (Anchor Boxes): For each position on the feature maps, SSD predicts bounding boxes with different aspect ratios and scales. These predefined bounding boxes are called “default boxes” or “anchor boxes.”

Bounding Box and Class Predictions: For each default box, SSD predicts the offsets for the bounding box and the class probabilities. The offsets adjust the dimensions of the default box to better fit the true object bounding box.

Confidence Scores: Like YOLO, SSD predicts a confidence score for each bounding box, indicating the likelihood that the box contains an object.

Non-Maximum Suppression: After predictions are made for all default boxes, non-maximum suppression is applied to remove redundant or low-confidence predictions.

Final Detection: The remaining bounding boxes with their associated class labels and confidence scores are considered as the final detections.

These are the popular way used nowadays of object detection and localization in an image. You play more. :) .. Any questions related to this, kindly ask in comments.

--

--

Rina Mondal

I have an 8 years of experience and I always enjoyed writing articles. If you appreciate my hard work, please follow me, then only I can continue my passion.