Celebrity Recognition : Building Custom Scalable Pipeline
Celebrity recognition is one of most trending use case in computer vision space due to the growing consumption of visual content over internet. Detecting people of interest (politicians, actors, sportsmen, businessmen, journalists, activists etc.) in the image or video is an interesting and challenging problem to solve. Annotated content helps businesses build better recommender systems and provide personalised user experiences.
Data Curation
Create named folders with multiple images per person, preferably ones with diversified facial expressions, poses, age and make-up dimensions. Enriched diversity ensures learning of disparate face encodings, which in turns boost prediction confidence in varied settings.
Training Data Generation
Encode all the detected faces in each celebrity folder. There are numerous open CV and DL based algorithms to detect facial bounding box, landmark marks and encode faces. Choose an appropriate one based on the availability of compute resources and desired performance. Save respective celebrity name, facial bounding box coordinates, encoding vector and marked image with box drawn around the face.
Every image in the dataset is passed through multitask cascaded convolutional networks to detect and align the face(s) inside. These faces are then passed through a deep CNN model to extract a 128 D feature vector.
Data Cleaning
More often than not, quite a significant amount of curated images are fallacious due to presence of multiple faces (same or different people) in a single image or misplacement of images in wrong folders.
To filter out correct face encodings per celebrity in an automated manner, graph algorithms sound promising due to their ability to discover structural relationships between components.
Network graph is built, where each node represents a detected face (celebrity name, bounding box coordinates, encoding, image with marked bounding box) and edges represents the relationship between two nodes. Adjacency matrix is used to build an undirected and unweighted graph. Faces with pair-wise distance < 0.5 are considered similar and share a common edge whereas pairs with distance >= 0.5 share no relationship.
Community Detection: It identifies highly connected and tightly knit sub-groups of nodes within a complex network to reveal the community structure.
As seen in above images, top communities contain non-overlapping face groups with apparent or indistinct differences among them. Communities with vertices less than a certain percent of the total number of vertices can be ignored. Retained communities are merged to generate exhaustive list of relevant faces.
Face Clustering: Chinese whispers is a graph clustering algorithm which differentiates all the people in a folder by assigning their respective images to different clusters.
Top clusters (other than celebrity) generally belong to family members, spouse, friends, contemporaries, doppelgängers etc. Long tail clusters are mostly noisy and sparse.
The largest cluster is assumed to be a celebrity cluster for obvious reasons. It works for most of the scenarios except when the folder itself is misnamed or majority of its images are incorrect. Alternatively, we can use some reference celebrity images to confirm the cluster choice. We can average out all the encoding vectors of top clusters and choose the cluster with mean encoding closest to the reference encoding.
Model Building
Discard rejected clusters and retain encodings only from chosen cluster to be used as training data for celebrity recognition model.
Classification Model: Build a classifier model using labeled face encodings using SVM or KNN algorithms. KNN is an instance-based lazy learner that does not need to delineate the boundaries between classes, so we can add more celebrities without the need of model re-training. Lazy learners are slow to query, thus, hurt the inference run-time performance. SVM, on the other hand, is an eager learner but would require re-training whenever we add newer celebrities.
Approximate Nearest Neighbours Search: ANN search algorithms scan large search spaces faster at a cost of reduced accuracy. Indexed search is lot faster as compared to brute-force search where comparisons are made against every observation to find most similar matches. Tradeoff between accuracy and retrieval speed can be achieved by tuning no of trees and nodes parameters.
We can circulate the predicted labeled images from the test set back to the training dataset to augment training data in a semi-supervised manner.
Model Inferencing
Images: Retrieve top most similar faces w.r.t. query face encoding using indexed files having distance less than a pre-set threshold. Return relative frequency of modal value as confidence score.
Videos: Extract key frames of a video and build face clusters. It is fair to assume that videos generally contain couple of people, inferencing on all the detected faces would be computationally expensive. Add a pre-processing layer to remove near-duplicate faces and run inference only on most distant pair of faces per cluster. Ideally, even the faces with largest pair-wise distance in a given cluster should result in same celebrity name. Return all the predicted names from different clusters as an output.