Countermeasures to tackle spoofing challenges in face recognition

Published in

betterplace

9 min readJun 6, 2022

Face recognition has become a widely used technology worldwide, especially post Covid-19 breakout, owing to its contactless and biometric features. Applications of face recognition have found usage in security & surveillance, authentication/ attendance management, remote access of control systems, digital healthcare etc., thereby eliminating the need for manual scanning. However, it comes with its fair share of challenges such as the difficulty of identifying similar looking faces, spoofing etc.

BetterPlace, a SaaS-based HRMS platform that enables enterprises to manage the entire life cycle of their blue- and grey-collar workforce uses a face recognition-based attendance system called ‘Attend’. ‘Attend’ has grown from being a 0.3 percent revenue contributor in FY 20–21 to a 3% revenue contributor in FY 21–22 at a significantly broader revenue base. Like any other face-recognition product, Attend faced some challenges as well. In this blog, we will look at how BetterPlace used machine learning to resolve the problem of spoofing — using video or photo to mark the attendance of the Target Group (TG) who is not at the site.

How does ‘Attend’ use face recognition?

Attend comes with a two-step process — At first, the face of an employee is captured and their basic information is provided, which is subsequently indexed as a vector in the collection specific to the employee’s organization. In the second step, the attendance marking takes place, which involves capturing a live image of the person whose attendance is being marked and then using face recognition to match with the pre-recorded data to mark attendance. We use MLKit for live face recognition and use internal modules for face registration — creating vector representation of face and search.

Here are some of the challenges that Attend has faced in the past:

Image quality issues affecting attendance marking in offline mode
Spoofing — using of video or photo to mark the attendance of the TG who is not at the site

How has BetterPlace resolved the spoofing problem?

The earlier version of ‘Attend’ lacked liveness check which led to dysfunctionality in marking attendance appropriately. The TGs used images of their coworkers to mark attendance. To solve this challenge we established the concept of liveness check, wherein attendance is only recorded when the concerned person blinks their eye and images can not be used for any kind of misrepresentation. Following this development, we were faced with another challenge. The TGs began marking attendance with pre-recorded videos of their colleagues blinking their eyes. This demanded the development of a solution/ model that could classify faces as spoof or non-spoof during the attendance marking process or thereafter, without relying on the blink of an eye.

Initial Approach and Quick Experimentation

To begin with, we did not have any image data with us which could help us understand if there’s any way to tell the difference between spoof and real faces, once captured. All we had was a couple of videos, example below, which showed us the process of spoofing.

Spoofing on BetterPlace Attend

Demonstration of how target group (TGs) uses video to mark the attendance of the person who is not present at the site…

youtube.com

Taking cue from the video, we began by assuming that if a video is being used, a portion/part of the phone might be recorded in the image, and hence an out-of-the-box object detection model which has been trained to identify phones [1], should work to identify such cases. However, out-of-the-box object detection models did not help because though the accuracy level was high, it was due to the rarity of spoofing events. They had high False Positive Rates (FPR) and False Negative Rates (FNR), thereby reducing trust in the model and rendering it unusable. The reason for this failure was the gap between the images used for building these object detection models and the images used in reality. In most training data on which these object detection models are based, an entire phone is visible [2], whereas in reality, that wasn’t the case when we looked at some of the manually identified spoofed videos. The image panel 1 shows a few of the manually identified spoof images. The first image was identified as spoof due to the presence of a time stamp on the image and a black panel at the bottom and top. The second image was identified as a spoof due to the glare on the image, while the third image one had a partially visible phone.

Panel 1: Example of manually identified spoof images, which formed part of our golden test data. We make sure that any model we develop has to perform well on this ever expanding data

Post this analysis, we manually curated the dataset (internal initial dataset) with images, in which only a portion of the phone is visible. We attempted to fine tune the Tensorflow Object identification API [1]. This model did not perform well in the field, see table 2, as the TGs frequently came too close to the screen, obscuring the entire phone, second image in Panel 1.
We also tested models that have been described in the literature [3]. The majority of these models used SVMs or other classifiers trained on vector representation of images in color spaces other than RGB, such as HSV, LUV, and others. These models performed admirably well on the open source data they were trained on, however they failed completely on our manually handpicked test data — golden test data.

Large Scale Data Analysis

We manually analyzed close to 10,000 photographs every day while creating the model to see if there were any variations between spoof and genuine images. We discovered that there is some blurriness, reflection, and high-level traits that distinguish regular and spoof photos. The panel 2, top row below illustrates some of these traits. For example, the first image contains reflection, which is common in photos of images taken on the phone, while the second image has reflection and blurriness, and the third image has vertical streaks, which is also typical when a photo of an image is taken on a phone.

Panel 2: Top row of the panel shows sample spoof faces, while bottom row shows sample real faces

The bottom row of panel 2 shows images of regular faces. From this, it can be seen that the first image is not very sharp, but has a vertical stream of light although it does not have reflections; the second image on the other hand is very sharp and it does not have any reflections or streak of light; third image is sharp and bright as well, and lacks the traits observed in spoof images.

As spoofs are rare events, we have manually identified close to 40 spoofs in the review of over 300,000 images. We had to curate and create a dataset which has a significant number of spoofs along with real images to train any ML classifier, as our attempt to solve this via object detection failed unfortunately.

Constructing a dataset

We already had regular images for model training. To create the spoof set, we used these images by manually uploading them to a phone and captured the images of the same with altered brightness of device screens, background lighting conditions and angles at which the device was held to adjust reflections. Post this, we extracted faces from spoof and normal images using MTCNN [6]. Thereafter our models were trained on these faces. We then used the open source dataset along with our custom set. The below table shows the number of faces which were present in various datasets we created and used.

Panel 3: illustrates the process of creating spoof image set from real images

Table 1: Distribution of Spoof and Real faces in various datasets which we used for experimentation

With an assumption that spoof images and faces will not have much depth information, as spoof images are captured from the phone, we also created a depth map based dataset using Intel Midas [4], example of the same is provided in panel 5.

Panel 5: The top two rows shows the depth map of spoof and real faces as derived using Intel Midas. Visually we see that the depth map of spoof and real faces have no difference, however, if we create the depth map of complete spoof and real images as illustrated in the bottom two rows, we see that spoof images lack depth, which is not the case with real images, thereby showing that our assumption holds true.

Taking hints from the literature we experimented by converting faces into other color spaces like LUV, YCbCr, HSV [5] etc, as illustrated in panel 6

Panel 6: The top row shows the real faces in different color spaces, while the bottom row shows spoof faces in various color spaces. From left to right, color spaces are RGB, HSV, LUV, YCbCr and Depth map. The difference between real and spoof in RGB space might be miniscule but it gets enhanced when used in other color spaces.

Creating Machine Learning Model

We only store faces of the people whose attendance is being marked to ensure faster attendance marking and lower bandwidth usage to transfer data. All our experiments are planned around faces of people and not full images. Another reason for using faces is that in group mode, multiple TGs can mark attendance at once. On the other hand, a model built on full image will mark either all TGs as spoof or none as spoof.

We used MTCNN [6] to extract faces from manually curated and created spoof and real data. Faces thus extracted were divided into train, valid and test data in the ratio of 0.7:0.1:0.2. Any model which identifies all spoofs correctly on this dataset will be considered as deployable, provided they have below 3% FPR and FNR on test data, which is our initial target metric.

At first, we tried to use VGG-Face [3] with different color space embeddings and a Support Vector Classifier (SVC). This model had relatively higher FPR and FNR values ranging around 4.5% and 6%. Alternatively, we tried the depth-based [4] model along with VGG-19 and we saw that the results had lower FPR and FNR rate, however, we couldn’t identify many obvious spoofs in the Golden Test Data. At the same time, deriving depth maps using Intel Midas was very slow and not ideal for the scale at which we operate during peak hours. We moved on to try a combination of VGG-face and a depth-based model which didn’t show much improvements in the constituent results. So we fine-tuned the VGG-19 model, which performed well on test data but not on golden test data. We trained the complete VGG-19 model using pre-trained parameters as a starting point, yielding a False Positive Rate (FPR) of 1.2 percent and a False Negative Rate (FNR) of 1.5 percent, which is considerably below our goal of less than 2% FPR and FNR. On our golden test data, it was able to detect all but one spoof. However, if the total number of attendance being marked daily is roughly 30k, an FPR of 1.2 percent provides us with around 400 false positives, which is a lot of false positives and is good enough to reduce trust in the system.

We tried MobileNet v2 to minimize the FPR and FNR while increasing the speed of inference. We got FNR and FPR of 0.2 percent and 0.18 percent, respectively, which roughly amounts to 60 false positives on a similar corpus of 30k attendees.
The video below illustrates the working of the model in real time. Below collage of image illustrates the performance of the MobileNet v2 model on our golden test data.

Panel 8: Illustrates the performance of MobileNet v2 on Golden Test Data. All 30 spoofs are identified correctly

Table 2: Shows the performance of various models in detecting spoof

Images in panel 6 shows spoof and real images are distinguishable in color space other than RGB. However, models trained on images in different colour spaces did not perform better than those trained on RGB images. Due to this we did not document the results of models trained in different color spaces. This could be attributed to the fact that CNNs, the basic unit of all the experimented models, can learn the filters which can convert the images from RGB to more differentiating HSV, LUV or YCbCr color space.

Deployment

A MobileNet v2 based model has been deployed in batch processing mode that identifies faces as spoof or non-spoof within a couple of minutes of attendance marking, and then sends suspected spoofs to site managers for verification and regularization of attendance. During initial deployment, we observed more than 95 percent of the sites. We saw similar FPR and FNR in the wild as we saw in our test results. We process around 20k images everyday. The model does not perform well in situations where the device is positioned behind a glass wall or when the illumination is dim. This conclusion aided us in updating our device installation instructions.

Next Steps:

To reduce the degradation in performance, we have started capturing full images along with the faces so as to make the monitoring process easier, and to understand better the reason for normal images being tagged as spoof. Post setting up monitoring pipeline, we plan to create a more diverse dataset so as to reduce the difference between training images and actual images being captured. We also plan to make the process of spoof detection more real time by deploying it on the device.

Reference:

Countermeasures to tackle spoofing challenges in face recognition

Spoofing on BetterPlace Attend

Demonstration of how target group (TGs) uses video to mark the attendance of the person who is not present at the site…

Written by BetterPlace