[CV] 9. Object detection with Sliding Window and Feature Extraction(HoG)

jun94

Published in

jun-devpBlog

9 min readDec 2, 2020

1. Basic Idea of Sliding Window

Let’s say we want to detect cars in an image with the sliding window approach. What do we need to fulfill the task?

**Figure 1. Given image for car detection, from [1, 2]**

The basic component for the task is a binary classifier. The part of image which overlaps with the window is taken and fed into the car classifier to determine whether a car is in it or not. Subsequently, a window is slid, and repeat this process until the window goes through the whole area in image. In short, the described process is a brute-force approach with many local decisions.

Another point to keep in mind is that the raw image window cannot be fed into the classifier. First, we need to extract features from the image window, and the classifier makes a decision based on the given features. Therefore, The complete pipeline for object detection with sliding-window method is as below, and the key component is the way we extract the features.

**Figure 2. Pipeline of object detection with sliding window, from [1, 2]**

2. Feature Extraction

Features are derived values from an initial set of data (in here, images) which are supposed to be informative, condensed, and non-redundant. There are numerous feature extraction methods and feature representations, but in this article, only the simplest and popular three types of representation are addressed.

2.1 Pixel-based representations

The first approach is pixel-based representations. It typically subdivides an image into pixel grids and makes it as a vector to extract features. In other words, let’s say there is an image with the size: 100 × 100. By flattening all 100 × 100 pixels, we obtain a vector with 10000 entries, and we can say this vector is a feature representation of a given image.

The advantage of this representation is the simplicity. However, pixel-based representations have a crucial drawback that they are sensitive to even small shifts. This disadvantage can be observed in the following figure.

In Fig 3, we have three images of the same koala, but a little bit shifted. Once we build the vector representations of (1), (2), and (3), we will obtain significantly different three vectors even though they are constructed from almost the same images. This behavior of pixel-based representations makes a classifier challenging to classify correctly.

2.2 Color-based representations

Another type of representation is color-based representations. As its name implies, the color features are extracted as a histogram in RGB space and used to make a vector representation. This process is illustrated in Fig 4.

Figure 4. Visualization of extracting color features from input image, from here

The above figure also includes the shape feature extraction, but let us only focus on the color feature extraction for a moment.
One might already notice that the color-based representations also have the same problem as the pixel-based representations. Let’s take a look at an example.

We have two images of (1) and (2). They look alike as they belong to the same object: Koala, and therefore, the extracted feature representations of (1) and (2) are expected to be similar. However, here comes the problem. Since the color of Koala in (1) and (2) is different, the corresponding color-based vector representations are largely different from each other. In general, we can say the color-based representations are sensitive to color (illumination).

Therefore, we need to find some representations that can generally be applied to a certain object regardless of color (i.e., intra-class appearance variation) and shift.

The solution to this is the Gradient-based representations.

2.3 Gradient-based representations

Gradient-based representations are vector representations composed of edge, contour, and intensity gradient information. The gradient-based approaches are decent because they summarize the local distribution of gradients with histograms, where the term ‘local’ refers to the equally-divided image patches (regions divided by green line in figure 5).

**Figure 5. Illustration of local distributions of gradients, from [1, 2]**

The advantages of gradient-based representations are as follows.

Locally orderless: invariance to small shifts and rotations
localized histograms from local distributions (distributions from each grid in green line in Fig 5) offers more spatial information compared to a single global histogram (distribution of complete image)
Includes contrast normalization: reduce the impact of variable illumination (color)

Some terms like ‘histogram of gradients from local distributions’ might be confusing. I will elaborate them in the following section.

Note that only the simplest approaches of pixel and color-based representations are addressed in order to emphasize their drawbacks compared to the gradient-based representations. In practice, they are more complicated than what is described here.

3. Histograms of Oriented Gradients (HoG)

The ‘Histograms of Oriented Gradients (HoG)’ is one of the representative feature descriptors with gradient-based representations that is commonly used to extract strong features for object detection.

In order to prevent one from getting lost, let’s start with a bird’s-eye view of HoG. Figure 6 shows the complete steps of how HoG works to extract local features from a given image patch, which is a part of the image overlapped with the sliding window.

**Figure 6. Complete process of HoG, adapted from [8]**

3.1 Brief HoG descriptor processing chain

(1) By cropping, we obtain an image patch.

(2) Compute the gradients for pixels in the patch.

(3) divide the image patch by cells where each cell is composed of 8 ⨯ 8 pixels (local regions in image patch).

(4) make histograms using the computed gradients (magnitude, orientation) for each cell.

(5) L2 normalization with neighboring histograms

(6) concatenate all histograms to a vector, and it is the vector representation produced by HoG for the given image patch.

3.2 More details to step-by-step Hog descriptor

Note that figures below are adapted from [1, 8, 9]

Step 1. From a given image, HoG descriptor gets an image patch (window).

step 2. preprocess the image patch with Gamma compression

Step 2. To reduce the effect of overly strong gradients, which come from the not-even amount of light (think of area inside and outside of the shadow. the gradients around the boundaries will have two strong magnitudes compared to other areas), HoG performs the Gamma compression. This preprocessing is simply done by replacing each pixel intensity with its square-root.

step 3. compute the image patch gradients (magnitudes and orientations for each pixel in the patch)

Step 3. Subsequently, the x and y-derivatives of image patch are computed and the corresponding magnitudes and orientations by pixel are stored.

Note that here, unlike other cases, Gaussian smoothing is not applied, and the simple finite difference filters show decent performance.

step 4. subdivision of an image patch into cells, and weighted voting

Step 4. Afterward, the given image patch is divided into cells whose size is 8 ⨯ 8. As a result, the number of cells in the patch will be (width/8) * (height/8), and each cell contains 64 pixels (8 * 8 = 64).

The next step is to create a local histogram of gradients for each 8×8 cells.

The histogram contains 9 bins corresponding to angles 0, 20, 40,… 160. Note that a histogram will be represented as a 1 ⨯ 9 vector later.

The figure below illustrates how the histogram for one cell in the image patch is filled out using gradient orientations (directions represented by 0 ~ 180 unsigned degrees) and magnitudes.

step 4. Illustration of weighted voting (1)

In gradient direction and magnitude maps, there are 64 values, respectively, as the maps are computed from an 8 ⨯ 8 cell.

In order to fill the bins, each pixel value in the maps is considered. HoG, firstly, looks at a value in the gradient direction map. Based on the gradient direction value, the bin is selected. Then, the gradient magnitude value corresponding to the direction value is used for voting (filling the bins).

For example, in above figure, the gradient direction and magnitude values from the first pixel in current cell is marked blue. First, the gradient direction (angle) of the pixel, which is 80 degrees, is considered to find an appropriate bin (5th bin). Then the corresponding magnitude value of 2 goes into the 5th bin.

Same for all direction and magnitude values from other pixels, but one should keep in mind that this process takes the ‘weighted voting.’ Let’s see this with an example. Take a look at the red circle this time. The gradient angle is 10 degrees, and it is the half angle between 0 and 20 degrees. This indicates that the red pixel evenly contributes to the 1st and 2nd bins, and therefore, the corresponding magnitude of 4 is evenly split, and each bin gets two votes.

Another example of weighted voting is as follows.

step 4. Illustration of weighted voting (2)

This time, the given direction (angle) is 165 degrees. It lies between 160 and 180, but closer to 160. Since the number of votes that a bin gets is proportional to how close the given angle is to the angle corresponding to the bin, The bin with 160 degrees gets more votes than that with 0 degrees.

step 4. complete histogram after weighted voting for all pixels in the image patch

Note that angle 0 also corresponds to the 180 degrees

Step 5. Image gradients are sensitive to lighting, and this could result some unexpected behaviors. In order to make the gradient-based feature representations robust to lighting variations, the normalization is performed to suppress the effect of lighting and to remove the scale, e.g. color value range: [0, 255] -> [0, 1] after normalization.

But before conducting the normalization, note that each cell updates its 9 ⨯ 1 histogram by concatenating it with histograms from the 3 neighboring cells, and therefore, the histograms in each cell becoes 36 ⨯ 1 (combining four 9 ⨯ 1 histograms = 36⨯ 1 histogram). For example, look at the right panel of above figure where each green grid indicates the 8⨯8 cell. The 9⨯1 histogram, after weighted voting, of the top-right cell in the blue box becomes 36 ⨯ 1 histograms by concatanating histograms from the other 3 cells in the blue box. After repeating the process for all cells in the image patch, every histogram has the size of 36 ⨯ 1. Then, the normalization is performed.

step 6. As a last step of HoG descriptor, the final feature vector for the given image patch is the concatenated all 36 ⨯ 1 histograms (vectors) from each cell.

For instance, if there are 105 cells in the image patch, the feature vector will be a 36*105 = 3780 dimensional vector.

4. Example of HoG descriptor for object detection

So far, we learned how to extract features with HoG for a given image patch, in order to determine whether an object exists or not in the patch. The last component to the object detection is the ‘Classifier.’

The common choice of classifier for conventional object detection (without deep learning) is the SVM (support vector machine).

With HoG and SVM combined, we can get the object detection results as below.