Object Detection and Facial Recognition using OpenCV and Tensorflow

aishwarya murkute
5 min readApr 25, 2020

--

There are various approaches for object detection. In this project, we are using Google’s TensorFlow which is an open-source machine learning framework to overcome this problem and identify the object to classify and recognize them. It is of the most popular APIs used for the identification of objects in the real world. Also, TensorFlow uses faster RCNN, i.e. Convolutional neural networks to identify the objects in an image.

A unified model for object detection is used to identify the objects in an image. The model is easy to construct where it will be trained on a loss function which is directly proportional to the detection and performance of the entire model combined together.

TensorFlow an API is an open-source machine learning library that is developed by researchers and engineers at Google’s Machine Intelligence research organization which runs on multiple computers to distribute the training workloads.

An object detection API is an open-source framework built on top of TensorFlow which makes it easy to construct, train, and deploy object detection models.

We have used the OpenCV computer vision library for the identification of the object in real-time and the detection of the objects.

A unified model for object detection is used to identify the objects in an image. The model is easy to construct where it will be trained on a loss function which is directly proportional to the detection and performance of the entire model combined together. RCNN is used for Selective Search or Edge boxes. It all replaces all the selective search with a very small convolutional network called Region Proposal Network to generate regions of Interest.

Image Classification using CNN

In place of predicting the class of an object from an image, we now have to predict the class as well as a rectangle (called bounding box) containing that object. It takes 4 variables to uniquely identify a rectangle. So, for each instance of the object in the image, we shall predict the following variables:

Label 1: class_name,

Variable 1: bounding_box_top_left_x_coordinate,

Variable 2 :bounding_box_top_left_y_coordinate,

Variable 3 :bounding_box_width,

Variable 4: bounding_box_height

All the four coordinates are associated with a class label so that the classifier can distinguish the different objects in the ROI. In this project, 36 different objects are used hence there are 36 different labeled classes. The labeling of the classes using ImageLabel is seen in figure 3. Where a bounding box is drawn around the ROI. As seen in the figure a green bounding box is drawn around the face in order to extract the RIO, i.e. pixels lying within that area.

Drawing bounding boxes to show the area of interest
CNN method for ROI extraction and classification
CSV File for labeling

Since we are using the TensorFlow API, and google TensorFlow does not accept the input for training in the.CSV format hence we have to convert these files to. TFrecord format.

We feed these. TFrecord file for training and the following are the results that we get from the training.

The following graph shows the loss function of the trained model. In order to obtain more accuracy, we trained the model up to 6000 epochs. (Epochs is the number of steps in the training). If the number of steps is more, the accuracy of the trained model is also more.

The loss function keeps on dropping, based on the number of epochs.

The loss function keeps dropping as the model is trained as seen in the graph below form figure 1. This happens because the model is training and learning. The part where we are training the model

In most of the machine learning, the learning rate is always inversely proportional to the loss_function during the training. As the loss function keeps dropping the model learns more. Hence the learning rate is inversely proportional to the loss function. The more the neural net learns greater is the recognition accuracy.

Learning during the initial stages of the training

The above graph is the result of the initial stages of training. From figure 2 we know that the loss function during the initial stages is fluctuating and hence is the learning rate of the model (i.e. learning rate after 300 epochs)

Learning during the initial stages of the training

The following graph shows the batch size for training during the initial stages of training. The batch size is the number of images that are fed to the neural net per epochs. Hence to improve the learning rate we increase the batch size to improve the accuracy.

Results of the trained model using a live camera and the input image

The following are the results of object detection using the input images, video, and live feed. The initial prototype of the model would detect input images from the user. In figure 26 result of the input image and Figure 27 is the result of the live feed of a web camera.

Input image detection
Face Recognition in webcam
Object detection in webcam

Hence, we have proposed one of the most versatile methods for object detection and recognition. Further, the method would require a small completely supervised training set so that it can deal with unsupervised datasets that may be added to the system at any stage. The foundation of the proposed project has been laid and the discussion of existing methods and future path adequately demonstrate the feasibility of the proposed approach

References

[1] ImageNet Classification with Deep Convolutional: Neural Networks Alex Krizhevsky University of Toronto kriz@cs.utoronto.ca,Ilya Sutskever University of Toronto ilya@cs.utoronto.ca,Geoffrey E. Hinton University of Toronto

[2] XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks

https://pjreddie.com/media/files/papers/xnor.pdf

[3] Playing around with RCNN, State of the Art Object Detector

https://cs.stanford.edu/people/karpathy/rcnn/

[4] J. Cabrera and P. Meer, “Unbiased estimation of ellipses by bootstrapping,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 18, pp. 752–756, 1996.

[4] Mask R-CNN Kaiming He Georgia Gkioxari Piotr Doll´ar Ross Girshick

[5] R. K. K. Yip, P. K. S. Tam, and D. N. K. Leung, “Modification of hough transform for object recognition using a 2-dimensional array,” Pattern Recognition, vol. 28, pp. 1733–1744, 1995.

--

--