Introduction to object detection

Ornela Megne
4 min readMay 9, 2022

--

In machine learning, computers are able to “identify” objects thanks to computer vision. But have you ever asked yourself how computers identify and locate objects in a given scene?. In this post we aim at answering this question. Here we go…!😎

What is computer vision?

Computer vision (CV) is a field of AI that tries to get computers to extract information from images and videos. In CV area, there are many different tasks such as: image classification, semantic segmentation, object detection, instance segmentation, etc. Here we will introduce the object detection task for a single object in a given image.

Object Detection

In general, object detection tries to answer the question :

“What” objects are in an image and “where” are they?

  • “What” means classification which is about labeling each object in a given image;
  • “Where” means bounding boxes, refers to localizing the position of the object in a given image.
Figure1: Made in canvas

So for a given input (images, videos), an object detector aims to predict both the bounding boxes and labels for each object.

For example consider that we want to detect a single object in a given image as shown in the figure x.

Figure2 : Detecting a single object
  • In order to extract features, the input image will pass through a convolutional neural network (CNN). This feature extractor can be a pertained model such as Alex Net, VGG Net, etc.
  • The output of the CNN process feeds into a fully connected layer to get our final feature vectors, from here the task is divided into two; The first part is a classifier (e.g: Linear layer with softmax, SVM, Random forest) to classify those vectors as one of the known classes, and the second one is a regressor, to predict the coordinates of the bounding box associated to the classified object. In this second case, the output of the model is the X, Y coordinates, the width ( W) and the height (H) of the located object in an image.
  • Now we can see that object detection requires a multi-task loss to realize both classification and localization.

a. Classification loss

When dealing with classification task we use the Cross entropy loss as defined in the equation below.

Generally it is also beneficial to look at the confusion matrix to understand the behavior of our model and which classes the model is better at predicting, this information can also be used to compute metrics namely, accuracy, precision, recall.

b. Regression loss

To measure how far the predicted bounding boxes are from their ground truth, we can use L1-norm or L2-norm (known as MSE loss) as our loss function.

Where y_hat is the predicted coordinate of the bounding boxes and y the ground truth coordinate.

Another widely used regression loss functions in many object detection models is the Intersection over Union (IoU) [1], also known as Jaccard index, IoU compare the similarity between two arbitrary boxes. By definition, it calculates the overlap of the two bounding boxes and divides it by their union.

Figure4: IoU

The final loss of the detector is the weighted sum of the classification loss and the regression loss. The idea behind performing the weighted sum is to control the magnitude of each loss (one loss could have a higher magnitude than the other) so the loss with higher magnitude will not dominate the overall loss.

Images can have more than one object! In this case how do we identify and locate multiple objects in a single image?

Several approaches have been used so far; some of which are sliding window, R-CNN (using selective search), Fast R-CNN, Faster R-CNN.

Conclusion

Coming to the end of this post we presented a brief description of object detection model’s architecture and how it works for single objects in a given input images. In our next post we will outline briefly how Mask-RCNN works and showcase its application using Detectron2.

Thanks for reading, and hope it was insightful 😉.

References:

[1]:Yu, J., Jiang, Y., Wang, Z., Cao, Z. and Huang, T., 2016, October. Unitbox: An advanced object detection network. In Proceedings of the 24th ACM international conference on Multimedia (pp. 516–520).

--

--