How Object Detection Evolved (Part 2)

From Region Proposals and Haar Cascades to Zero-Shot Techniques

Andrii Polukhin
5 min readJun 26, 2023

Object detection algorithms have advanced from early computer vision to deep learning, utilizing various methods in modern systems for accurate detection.

This blog post serves as a following chapter in my ongoing series titled “The Evolution of Object Detection: From Region Proposals and Haar Cascades to Zero-Shot Techniques”.

For a comprehensive understanding of the subject, I highly recommend exploring the preceding and next sections of this story, which can be accessed through the following links:

Without further ado, let us resume our expedition through the realm of object detection.

Traditional Detection Methods

The world of object detection algorithms has seen many changes since the first time methods for face recognition were actively researched. In this article, we will look at the development of this field since 2001, when several reviews of object-based face detection methods have already been conducted.

At that time, there were two main approaches: Image-based and Feature-based. Image-based approaches used methods such as SVMs (Support Vector Machines) and Linear Subspace Methods. They also made use of convolutional neural networks (CNNs) like LeNet, which played a significant role in early image recognition tasks. Statistical methods were also employed, including techniques such as Gaussian mixture models and probabilistic models based on the normal distribution.

Although some of these methods were interesting from a research point of view and may have their value for general familiarization, they are no longer used in modern object detection systems. Instead, modern approaches are based on large neural networks that allow for efficient image comparison and object recognition. Such approaches provide much more representative results.

Figure 1. Face Detection Methods in 2001 | Source: Face Detection: A Survey

Viola-Jones Detectors (2001)

One of these algorithms — the Haar cascade, also known as the Viola-Jones algorithm.

Figure 2. Viola-Jones algorithm parts: (а) combination of regions, (b) Haar Features, © cascade classifier, (d) Haar feature applies to the image, and (e) LBP feature. | Source: Selection of Viola–Jones algorithm parameters for specific conditions

The Haar cascade algorithm is based on a simple idea. If we want to detect faces in an image, generally speaking, all faces have similar characteristics, such as two eyes, a nose, and a mouth. For example, the eyes usually have a certain shape, the bottom of the face is darker because of shadows, and the cheeks and nose can be highlighted when taking a photo.

Thus, we can form a set of templates that describe these face characteristics. These templates can be in the form of small squares or rectangles. Convolution operations are used to convolve these templates with image patches to generate feature maps, which are subsequently analyzed for object detection.

The cascade approach of the Haar algorithm is used because of its advantages. The authors use a boosting method and sequentially apply different templates, which allows the detection of faces with a lot of variabilities, such as tilts and lighting conditions. After sequentially applying different classifiers based on the cascade of templates, the algorithm makes decisions at each stage to determine whether to continue evaluating a candidate region as a face or reject it.

As a result, we get an object detector that works quickly and can show good results when various factors, including training data, feature selection, and application context, are considered.

HOG Detector (2005)

The HOG (Histogram of Oriented Gradients) algorithm was invented in 2005 and differs from deep learning image processing methods by not using neural networks.

  1. First, the image is divided into small subpictures of 8x8 pixels. For each sub-image, gradients are calculated, resulting in a set of gradient values. These values are distributed into a histogram with a specified number of bins, representing the distribution of gradients in that subregion. The histograms from multiple subregions are concatenated to form the feature vector.
  2. Next, the histograms are normalized using a process such as histogram equalization to enhance the contrast and equalize the intensity amplitude of pixels in different parts of the image. This helps improve the overall visual representation.
  3. After normalizing the histograms, a descriptor is computed for each region covered by a sliding window that moves across the image at multiple scales and aspect ratios. By examining these detection windows and comparing the feature vectors extracted from them, objects like faces can be detected. A trained classifier, often a support vector machine (SVM), is used to determine whether the object of interest is present.

While this method can detect faces, it may not be as effective in detecting fine-grained details or complex structures such as scratches or brain tumors, limiting its use for such tasks.

At first glance, one might suggest incorporating more complex features that consider color and other parameters, and indeed, further research has explored such modifications. For instance, combining HOG with other feature descriptors like Histograms of Color or Haar-like features has shown promising results. Additionally, there exist effective methods that leverage partial features for object detection, such as combining multiple feature descriptors to find objects like a person or a face. Although these methods can be more intricate, they have demonstrated improved accuracy in certain scenarios.

Overall, the HOG method is an effective approach for detecting objects in images, particularly for tasks like face detection. By utilizing mathematical methods and gradient-based features, it achieves good results. Nevertheless, further research and modifications of the method can lead to improvements in its efficiency and accuracy.

Part-based Approaches

  • Deformable Part-based Model (2010)
  • Implicit Shape Model (2004)

The Deformable Part-based Model (DPBM), proposed by Felzenszwalb et al. in 2010, is an object detection method based on the concept of variable-shaped parts. The Implicit Shape Model (ISM), proposed by Leibe et al. in 2004, is an object detection method that represents the shape of an object as a set of local features and uses statistical methods to find the most likely areas of an object in an image. Both methods have been widely used in object detection tasks, helping to improve the accuracy and reliability of image processing algorithms.

Figure 3. An example detection obtained with the Deformable Part Model | Source: Computer Vision for Autonomous Vehicles: Problems, Datasets and State of the Art

I’ll discuss Deep Learning-based Detection Methods and (Zero | One | Few) — Shot Object Detection Methods in later sections.

Coming Up: Part 3 🔥

Thank you for taking the time to read this article. If you found it informative and engaging, feel free to connect with me through my social media channels.

If you have any questions or feedback, please feel free to leave a comment below or contact me directly via any of my communication channels.

I look forward to sharing more insights and knowledge with you in the future!

--

--

Andrii Polukhin

I am a deep learning enthusiast. Currently, I am a ML Engineer at Data Science UA and Samba TV. Writing about neural networks and artificial intelligence.