Celebrity Recognition : Building Custom Scalable Pipeline

DS Musings
Analytics Vidhya
Published in
6 min readAug 3, 2020

Celebrity recognition is one of most trending use case in computer vision space due to the growing consumption of visual content over internet. Detecting people of interest (politicians, actors, sportsmen, businessmen, journalists, activists etc.) in the image or video is an interesting and challenging problem to solve. Annotated content helps businesses build better recommender systems and provide personalised user experiences.

Data Curation

Create named folders with multiple images per person, preferably ones with diversified facial expressions, poses, age and make-up dimensions. Enriched diversity ensures learning of disparate face encodings, which in turns boost prediction confidence in varied settings.

Training Data Generation

Encode all the detected faces in each celebrity folder. There are numerous open CV and DL based algorithms to detect facial bounding box, landmark marks and encode faces. Choose an appropriate one based on the availability of compute resources and desired performance. Save respective celebrity name, facial bounding box coordinates, encoding vector and marked image with box drawn around the face.

Every image in the dataset is passed through multitask cascaded convolutional networks to detect and align the face(s) inside. These faces are then passed through a deep CNN model to extract a 128 D feature vector.

Data Cleaning

More often than not, quite a significant amount of curated images are fallacious due to presence of multiple faces (same or different people) in a single image or misplacement of images in wrong folders.

To filter out correct face encodings per celebrity in an automated manner, graph algorithms sound promising due to their ability to discover structural relationships between components.

Network graph is built, where each node represents a detected face (celebrity name, bounding box coordinates, encoding, image with marked bounding box) and edges represents the relationship between two nodes. Adjacency matrix is used to build an undirected and unweighted graph. Faces with pair-wise distance < 0.5 are considered similar and share a common edge whereas pairs with distance >= 0.5 share no relationship.

Community Detection: It identifies highly connected and tightly knit sub-groups of nodes within a complex network to reveal the community structure.

Different colours represent communities. Anomalous faces are isolated from major communities as outliers.
Top communities for Angelina Jolie

As seen in above images, top communities contain non-overlapping face groups with apparent or indistinct differences among them. Communities with vertices less than a certain percent of the total number of vertices can be ignored. Retained communities are merged to generate exhaustive list of relevant faces.

Face Clustering: Chinese whispers is a graph clustering algorithm which differentiates all the people in a folder by assigning their respective images to different clusters.

Benedict Cumberbatch cluster: Observe diverse facial expressions and character makeovers
Arnold Schwarzenegger cluster: Observe diverse facial makeovers, physical transformations and ageing
Angelina Jolie look-alike cluster in Angelina Jolie folder & Marilyn Monroe cluster in Audrey Hepburn folder

Top clusters (other than celebrity) generally belong to family members, spouse, friends, contemporaries, doppelgängers etc. Long tail clusters are mostly noisy and sparse.

The largest cluster is assumed to be a celebrity cluster for obvious reasons. It works for most of the scenarios except when the folder itself is misnamed or majority of its images are incorrect. Alternatively, we can use some reference celebrity images to confirm the cluster choice. We can average out all the encoding vectors of top clusters and choose the cluster with mean encoding closest to the reference encoding.

Model Building

Discard rejected clusters and retain encodings only from chosen cluster to be used as training data for celebrity recognition model.

Classification Model: Build a classifier model using labeled face encodings using SVM or KNN algorithms. KNN is an instance-based lazy learner that does not need to delineate the boundaries between classes, so we can add more celebrities without the need of model re-training. Lazy learners are slow to query, thus, hurt the inference run-time performance. SVM, on the other hand, is an eager learner but would require re-training whenever we add newer celebrities.

Approximate Nearest Neighbours Search: ANN search algorithms scan large search spaces faster at a cost of reduced accuracy. Indexed search is lot faster as compared to brute-force search where comparisons are made against every observation to find most similar matches. Tradeoff between accuracy and retrieval speed can be achieved by tuning no of trees and nodes parameters.

We can circulate the predicted labeled images from the test set back to the training dataset to augment training data in a semi-supervised manner.

Model Inferencing

Images: Retrieve top most similar faces w.r.t. query face encoding using indexed files having distance less than a pre-set threshold. Return relative frequency of modal value as confidence score.

Inference output on images : Face bounding box, celebrity name and confidence score

Videos: Extract key frames of a video and build face clusters. It is fair to assume that videos generally contain couple of people, inferencing on all the detected faces would be computationally expensive. Add a pre-processing layer to remove near-duplicate faces and run inference only on most distant pair of faces per cluster. Ideally, even the faces with largest pair-wise distance in a given cluster should result in same celebrity name. Return all the predicted names from different clusters as an output.

Ben Affleck, Anna Kendrick and interviewer clusters. Farthest face pairs per cluster are marked in yellow.

--

--