# Object Detection From Scratch (Part 1)

This year, with the pandemic increasing and influencing our lives more and more, I found myself in numerous projects and tasks for self-improvement. As more than one of them were about object detection, I realised that I had to learn a detailed theoretical background while still working on some code-based project parts. I wanted to start from a little past, and write how object detection evolved within time. Throught these writings I will try to explain Gradient Vectors, HOG (Histogram of Oriented Gradients), CNN (Convolutional Neural Network), DPM (Deformable Parts Model), Overfeat, YOLO (You Only Look Once) ve R-CNN (Regional CNN) in my own words.

First of all, in this post I will introduce Gradient Vectors for images and then explain Histogram of Gradient Vectors in which Gradient Vectors will be used. Let’s start from Gradient Vectors.

Gradient Vector is a primary step in image processing and its aim is to show an image’s gradient. As to say, it is used to analyze the changes, direction and magnitude of these changes, and the general tendencies throught the image. It’s formula is :

If we study this formula further, we remove the central pixel that we work with and try to find the difference of the pixel values for the pixels around it. By doing this for every pixel in an image, we analyze the difference and direction and a general inference for the photo. By applying this formula, we get a vector (namely the gradient vector) and it’s information is used in two ways : magnitude and direction.

For the magnitude and the direction we use the formulas that are similar to the ones that we have seen in mathematics:

Our gradient is 2-dimensional; the x-gradient and y-gradient are parallel to the filters that we use in computer vision for a horizontal and vertical filter. If you have already studied deep learning, they are the filters that we talk about for a Conv-2D layer, which are used to understand a picture’s overall change in the x-dimension and y-dimension. The usage of these filters are also an introduction to the convolution operation (*) in mathematics within the computer vision society. In early times, only the 2 adjacent pixels were used to calculate the gradient (just in the way I explained above), [-1, 0, 1] was used to calculate the x-gradient value (the middle pixel is negliged as it is multiplied with 0, and the other coefficients made up the difference); and [-1, 0, 1] transpoze was used to calculate the y-gradient value.

Here we have mentioned two new concepts: filter and convolution. In the image processing, filter is a nxn matrix to detect a certain feature from an image, convolution operation is to multiply each pixel with the appropriate value on the filter and sum them up as the filter is being slided.

In the beginning [−1,0,1] ve [−1,0,1]’s tranzpoze were used as the first filters. Within time, Prewitt and Sobel operators were used in common to have a better and more detailed (instead of 4 adjacent pixels, 8 were used) information:

As to summarize the information above, I want to go through and example. We will analyze a white circle on a black background and the images that we will obtain by x and y-filtering the photo. By combining them, we will obtain the gradient vector.

In this picture there are only two colours and when we filter this image we have to observe to colour passages of these two colours and observe the circle shape. When we use a horizontal filter we get an image like this one:

In this image the circle shape is easily observable, also we see that the left side of the circle is white whereas the right side is black. We should understand from this information that the colour transition from the background to the “circle part” is a transition to white and the transition from the circle to the background is a transition to black. So the area from the white until the black part is white and it is black afterwards. We can also say that the transition is observed parallel to the x-axis as this is a horizontal filter, so we observe this axis’ transition better in this filtered image. If we repeat the same process for a vertical filter :

Again, we observe the colour transitions throught the y-axis. Now, if we combine these two images -as to say combine the x-gradient and y-gradient-we will reach the image gradient :

In gray-scale pictures pixels can take values between 0 and 255; 0 for the colour black and 255 for the colour white. In this image, the magnitude of the gradient vector can be observed and it will be a sign for the magnitude of the change. If the change is big, the pixel value is the maximal, so it is white; and otherwise black. Contrarily, in the filtered images; an image’s normal state is pictured as gray so anything other than gray is considered to be a change.

This method is passed as the oldest method to do an object detection with traditional computer vision concepts. It uses image gradient vectors that we have mentioned above to reach a generalized information about image’s tendencies.

For this, our steps are :

1. Resizing the image so that we can divide it into 8x8 cells
2. Calculating the image gradient our every pixel in every cell
3. Creating a histogram graph by observing the gradient directions in 8x8 cells
4. (Optional) Image normalization to easen things
5. Visualizing this histogram graph for every cell

For an example, I will use the image from learnopencv.com :

After the preprocessing let’s observe it’s divided version into 8x8 cells :

For each cell, we should calculate the gradient vector in every pixel. After this we will only use the direction matrix, but for the visualization :

Later, we make a histogram for the degree values 0, 20, 40, …, 180. The middle values are added to the closer value :

In this table 0 degree is represented with 0, 20 degrees with 1 and so on. After this step, we should combine this information into a big multi-dimensional vector and visualize it on the image itself :

We can observe similar motifs in similar parts of the image. As to say, the background has a horizontal tendency whereas the middle of the image is more vertical. The changes in the directions help us define where an object is detected and its boundaries. With such a way, we defined an object detection algorithm without an automatized factor (such as deep learning) and worked only with traditional image processing methods.

For a final phrase, if we feed these multi-dimensional vectors to a Support Vector Machine; we can generally observe that it gives good results.

References and images :