In my last blog, I have already mentioned my project(Poor man’s recognition) and my own experience and opinion. Here I will be discussing my project in great details and how I plan to do it.
In this project, I am building a free version of Amazon Rekognition with the maximum possible feature. Amazon’s rekognition has one single API to recognize and analyses multiple images and videos scene. It can identify the objects, people, text, scenes, and activities, as well as to detect any inappropriate content. Facial recognition is one of their unique selling point features. It can detect, analyze, and compare faces for a wide variety of user verification, people counting.
I proposed to implement 7 use-cases and integrate it with a web app.
1.Face and eyes detection using OpenCV:
OpenCV comes with a trainer as well as a detector. Here I have used OpenCV for detection and later in the project, I will use it to create an XML file of faces for Face recognition. OpenCV already contains many pre-trained classifiers for face, eyes, smiles, etc. Those XML files are stored in the Library/etc/haarcascades. In this part, I have used face cascade and eye cascade to detect face and eyes in an image. OpenCV uses a machine learning algorithm and it contains pre-trained cascade XML files which can detect a face, eyes, etc. This basically breaks the image into pixels and form blocks, it does a very rough and quick test. If that passes, it does a slightly more detailed test, and so on. The algorithm may have 30 to 50 of these stages or cascades, and it will only detect a face if all stages pass.
This technique works on the Viola-Jones Algorithm, which is a part of deep learning. This statement was said on the context of:- deep learning is a class of machine learning algorithm that learns in supervised (e.g., classification) and/or unsupervised (e.g., pattern analysis) manners.
This part of face detection is also used in facial recognition section and there I will use this the file as an unrecognized file to be saved in the database and to be used as another face with no name registered.
I have used Deep Learning face recognition embedding. Here I am using deep learning and this technique is called deep metric learning.In deep learning typically a network is trained to:
1. Accept a single input image
2. And output a classification/label for that image
However, deep metric learning is different. Instead, of trying to output a single label (or even the coordinates/bounding box of objects in an image), instead of outputting a real-valued feature vector. For the dlib facial recognition network, the output feature vector is 128-d (i.e., a list of 128 real-valued numbers) that is used to quantify the face. Training the network is done using triplets:
Facial Recognition via Deep metric learning involves “triplet training step”
I have first created a database for the training set and encoded (128-d) each face image into a numpy array and turn it into an XML file. Second I have imported that trained XML file into the main script to detect and recognize a face.
This part is same as the above one the only reason I made it a different sector is because this feature is listed in Amazon’s rekognition project and as this is a similar project I have to add this additional name tag and create a whole new dataset consisting of many known actors.
Here I have also used deep metric learning techniques.
4.Object Detection in Tensorflow
Object detection can be done via multiple ways like feature-based, Viola Jones, SVM classification with HOG Features, resnet 50 and Deep learning and I have used Tensorflow because Tensors are just multidimensional arrays, an extension of 2-dimensional tables to data with a higher dimension. There are many features of Tensorflow which makes it appropriate for Deep Learning. Tensorflow is Google’s Open Source Machine Learning Framework for dataflow programming across a range of tasks. Nodes in the graph represent mathematical operations, while the graph edges represent the multi-dimensional data arrays (tensors) communicated between them.
Object detection algorithm works on a principle known as feature extraction- In simple line, this feature is a process where features are extracted from the input images at hand and use these feature to determine the class of the image.
5.Read text from images using Tesseract
The process of extracting text from an image is called Optical Character Recognition. Here I am going to use Tesseract and OpenCV as an open source tool. I will also implement this using Kraken so as to get the proper result. The reason for using Tesseract first instead of Kraken is:
1. Tesseract work best for high definition images.
2.Can work better if a normal picture is converted into grayscale
3. Processing time is much better.
Steps of OCR working.
- Remove all the other noise particles from the image and only characters stays.
- Align the obtained character and convert them into a white and black format.
- Move down to pixel by pixel and check the obtained data of the character with the database to find a matching character.
6.Facial analysis using CNN
This use-case is divided into 2 sections:
- First detecting the frontal face in an image by using haar feature-based cascade classifiers. This part I have already completed in the first case.
- Second is by using Xception CNN Model, training CNN model architecture which takes bounded face (48*48 pixels) as input and predicts probabilities of 7 emotions in the output layer.
Will also have to download facial expression recognition (FER) data set. The data consists of 48×48 pixel grayscale images of faces. The faces have been automatically registered so that the face is more or less centered and occupies about the same amount of space in each image. The task is to categorize each face based on the emotion shown in the facial expression into one of seven categories (0=Angry, 1=Disgust, 2=Fear, 3=Happy, 4=Sad, 5=Surprise, 6=Neutral).
Training CNN model is architecture which is comparatively small and achieves an almost state-of-art performance of classifying emotion on this data-set.
Here I will just be detecting scene in a video for a reference be it any scene.
Before and after of any scene there are this graph drops which tells about an event occurred between that period of time.
The algorithm essentially will have 2 different methods for detecting scene changes.
1.The content method, which compares each frame sequentially looking for changes in content, which is useful for detecting quick cuts between scenes. This is the default method.
2. The threshold method compares each frame to a set black level. It is faster than the content method, but useful only when there are cuts and fades to/from black.
Next, I will be writing a blog for week 1 in GSOC.
- Rapid Object Detection using a Boosted Cascade of Simple Features — https://www.cs.cmu.edu/~efros/courses/LBMV07/Papers/viola-cvpr-01.pdf
- Face recognition with OpenCV, Python, and deep learning-https://www.pyimagesearch.com/2018/06/18/face-recognition-with-opencv-python-and-deep-learning/
- Histograms of Oriented Gradients for Human Detection — https://hal.inria.fr/file/index/docid/548512/filename/hog_cvpr2005.pdf
- Rich feature hierarchies for accurate object detection and semantic segmentation- https://www.cv-foundation.org/openaccess/content_cvpr_2014/papers/Girshick_Rich_Feature_Hierarchies_2014_CVPR_paper.pdf
- Using Tesseract OCR with Python-https://www.pyimagesearch.com/2017/07/10/using-tesseract-ocr-python/
- Facial Expression Recognition-https://www.researchgate.net/publication/227031714_Facial_Expression_Recognition
- Deep Learning based Text Recognition (OCR) using Tesseract and OpenCV-https://www.learnopencv.com/deep-learning-based-text-recognition-ocr-using-tesseract-and-opencv/
- CNNs for Face Detection and Recognition -http://cs231n.stanford.edu/reports/2017/pdfs/222.pdf