Video KYC — Part II

Face detection

10 min readOct 24, 2022

“Know Your Customer” or KYC refers to the process of verifying the identity of the customers either before or during the start of the business relationship based on documentary evidence from an authoritative source. Video-based KYC means that users can complete remote KYC from anywhere via a video call. The process starts when the customer is displaying their face in front of the smartphone’s camera or a webcam on the computer, while the system performs identity checks that include extracting facial biometrics, doing a liveliness check, authenticating the person is a person and simultaneously authenticating the documents (that are also presented in front of the camera later) to automatically perform and complete the verification KYC requirements.

In this series of articles, I show how different components of a generic video-KYC system can be implemented. In the first article of the series, I demonstrated how to access a webcam and capture the video stream from it. This article is dedicated to face detection — the process of examining either a photo or a video image in order to distinguish faces from any other objects in the background. Firstly I’ll present you with a number of face detectors, from classical to DNN-based along with libraries implementing them, then there will be a discussion on how to choose a suitable detector and finally, we’ll see how to apply face detector to frames captured by webcam.

OpenCV Haar-Cascade face detector

Haar Cascade is a machine learning-based approach for object detection where a lot of positive images (in the case of face detection task — images of faces) and negative images (images without faces) are used to train the classifier. The algorithm uses edge or line detection features proposed by Paul Viola and Michael Jones in their paper, “Rapid Object Detection using a Boosted Cascade of Simple Features” in 2001.

Haar cascade classifiers are not as accurate as more “state-of-the-art” algorithms, but they are extremely fast and because of it still useful, particularly when computational resources are critical.

Sure, many algorithms are more accurate than Haar cascades (HOG + Linear SVM, SSDs, Faster R-CNN, YOLO, etc’), but they are still relevant and useful today.

OpenCV comes with pre-trained Haar-Cascade models that can be used for face detection on an image or real-time video. The pre-trained models are located in the data folder in the OpenCV installation and the one I’m going to use is haarcascade_frontalface_default.xml. As its name implies, it is trained to detect frontally-viewed faces.

If you haven’t installed OpenCV yet, visit this page in order to get instructions on how to do it.

Let's see how to locate faces on the image and surround them with a border.

First, we need to define the classifier object

faceCascade = cv2.CascadeClassifier('haarcascade_frontalface_default.xml')

Then call detectMultiScale method. If the frame came from openCV reader, don’t forget to convert the colors to RGB format

rgb = cv2.cvtColor(resultImg, cv2.COLOR_BGR2RGB)
    faces = faceCascade.detectMultiScale(
            rgb,
            scaleFactor=1.3,
            minNeighbors=2,
            minSize=(100, 100))

Now we need to get the coordinates of the faces and draw surrounding rectangles. In the case of OpenCV Haar cascade implementation, we get the left-top coordinates, the width, and the height. This allows us to calculate right-bottom coordinates and draw the rectangle

for (x, y, w, h) in faces:
        x1,x2,y1,y2= x,x+w,y,y+h
        faceBoxes.append([x1,y1,x2,y2])
        cv2.rectangle(rgb, (x1,y1), (x2,y2), (255,0,0), int(round(frameHeight/150)), 8)

Putting it all together from the moment of reading the image:

def CC_highlight_face(frame):
    faceBoxes = []
    resultImg=frame.copy()
    frameHeight=resultImg.shape[0]
    frameWidth=resultImg.shape[1]
    faceCascade = cv2.CascadeClassifier('haarcascade_frontalface_default.xml')
    rgb = cv2.cvtColor(resultImg, cv2.COLOR_BGR2RGB)
    faces = faceCascade.detectMultiScale(
            rgb,
            scaleFactor=1.3,
            minNeighbors=2,
            minSize=(100, 100))
    
    for (x, y, w, h) in faces:
        x1,x2,y1,y2= x,x+w,y,y+h
        faceBoxes.append([x1,y1,x2,y2])
        cv2.rectangle(rgb, (x1,y1), (x2,y2), (255,0,0), int(round(frameHeight/150)), 8)
        
    return rgb, faceBoxesimg = cv2.imread('data/dw_10_2.jpg')
result, boxes = CC_highlight_face(img)
plt.imshow(result)

And here is the result. We can see that both faces are located correctly and accurately.

But this specific model located only clearly defined faces in the frontal position:

Dlib Histogram of Oriented Gradients (HOG) + Linear SVM face detector

As the name suggests, it uses Histogram of Oriented Gradients (HOG) and Linear SVM classifier for face detection. The method was originally described in the paper by Dalal and Triggs.

The idea behind HOG is as follows: firstly the distribution (histograms) of gradients’ directions (oriented gradients) of the image is calculated; those histograms are features that will be fed into a classification algorithm like SVM that will detect whether an object (for example, face) presents in a region.

In the original paper, the process was implemented for human body detection, and the detection chain was the following :

Diagram from the paper of Dalal and Triggs

The implementation of HOG+SVM is offered by Dlib library. Dlib was originally introduced as a C++ library for machine learning by Davis King. But later, a Python API was also introduced. If you don’t have it, look into the installation guide for windows 10. If pip/conda install will not work, you can install the package directly from wheel. Wheels can be easily googled, for example here is the link for wheel files for different python version and windows 10.

Dlib has a straightforward method to return HOG face detector dlib.get_frontal_face_detector(). For each detected face we retrieve top-left and bottom-right coordinates in this way:

x1=face.left()
y1=face.top()
x2=face.right()
y2=face.bottom()

Here is the whole piece of code

def DLIB_highlight_face(frame):
    faceBoxes = []
    resultImg=frame.copy()
    frameHeight=resultImg.shape[0]
    frameWidth=resultImg.shape[1]       
    
    detector = dlib.get_frontal_face_detector()    
    faces = detector(resultImg)     
    for face in faces:
        x1=face.left()
        y1=face.top()
        x2=face.right()
        y2=face.bottom()
        
        faceBoxes.append([x1,y1,x2,y2])
        cv2.rectangle(resultImg, (x1,y1), (x2,y2), (255,0,0), int(round(frameHeight/150)), 8)
        
    return resultImg, faceBoxesimg = cv2.imread('data/dw_10_2.jpg')
result, boxes = DLIB_highlight_face(img)
plt.imshow(result)

The results are rather accurate.

Although the method is suited for frontal face detection it locates half-profile faces, unlike frontal Haar cascades.

OpenCV deep learning-based face detector

In 2017 OpenCV 3.3 was officially released, bringing with it a highly improved deep learning module that supports a number of deep learning frameworks, including Caffe, TensorFlow, Torch/PyTorch, and Darknet.

While we cannot train deep learning models using OpenCV, this does allow us to take our models trained using dedicated deep learning libraries/tools and then efficiently use them directly inside our scripts.

OpenCV has its own accurate deep learning-based face detector included in the official release, which is based on Single-Shot-Multibox detector and uses ResNet-10 Architecture as the backbone. The model was trained using images available from the web, but the source is not disclosed. The Tensorflow version comes with .pb and .pbtxt files; .pb file is a protocol buffer file in a binary format that holds the graph definition and the trained weights of the model, and .pbtxt file holds the same in text format. The files can be found here. For simplicity, I’ve put both files in the root folder with my notebook.

faceProto=”opencv_face_detector.pbtxt”
faceModel=”opencv_face_detector_uint8.pb”
faceNet=cv2.dnn.readNet(faceModel,faceProto)

Method blobFromImage converts an image into a suitable format. We need to supply the width and height compatible with dnn architecture, in the case of opencv dnn width and height are 300.

blob=cv2.dnn.blobFromImage(rgb, 1.0, (300, 300), [104, 117, 123], True, False)

Function setInput(blob) sets blob as input to the dnn and forward method “runs” the network.

faceNet.setInput(blob)
detections=faceNet.forward()

The results come in a rather complicated format: it is a 4-dimensional array and if we’ll dive inside it we’ll see that we actually need to treat the matrix with n rows, where n is the number of all detected objects. We filter all the objects by the confidence threshold and retrieve the coordinates of detected objects in order to draw surrounding borders.

for i in range(detections.shape[2]):
        confidence=detections[0,0,i,2]
        if confidence>conf_threshold:           
            x1=int(detections[0,0,i,3]*frameWidth)
            y1=int(detections[0,0,i,4]*frameHeight)
            x2=int(detections[0,0,i,5]*frameWidth)
            y2=int(detections[0,0,i,6]*frameHeight)
            faceBoxes.append([x1,y1,x2,y2])
            cv2.rectangle(rgb, (x1,y1), (x2,y2), (0,255,0), int(round(frameHeight/150)), 8)

Here are the results

Yolo face detector and OpenCV dnn module

The YOLOv3 (You Only Look Once) is a state-of-the-art, real-time object detection algorithm. The published model recognizes 80 different objects in images and videos. For more details, you can refer to this paper.

Yoloface is the face detector based on YOLOv3 architecture. For face detection, you should download the pre-trained YOLOv3 weights file which is trained on the WIDER FACE: A Face Detection Benchmark dataset from this link. In addition, you’ll need a .cfg file (textual format). I’ve put both weight and .cfg into the root folder.

Let’s initialize the dnn object. As you can see Yoloface has been trained with darknet (supported by OpenCV)

faceYolocfg=”yolov3.cfg”
faceYoloWeights= “face.weights” 
faceYoloNet=cv2.dnn.readNetFromDarknet(faceYolocfg, faceYoloWeights)

Let’s prepare the blob for dnn. Yolo required width and height to be 416.

blob=cv2.dnn.blobFromImage(rgb, 1.0, (416, 416), [104, 117, 123], True, False)

Let's run the network

faceNet.setInput(blob)
detections=faceNet.forward()

And finally, we need to retrieve coordinates and find the border. The code is the same as for opencv dnn except for one little detail: it looks like the left-bottom x coordinate rather belongs to the center of the image than to its border, so it should be treated appropriately.

x2=int(detections[0,0,i,5]*frame_width) 
x2 += int((x2-x1)/2)

And here are the results:

The Dlib CNN Face Detector

Dlib’s HOG + Linear SVM model for face detection performs well in front-facing clear faces, but can fail when the orientation of the face changes, the faces are shaded, or if the faces appeared very small/unclear in the image/video frame.

Let’s try to improve the issue by using a Dlib CNN model for face detection. It is a pre-trained model that will be loaded while executing our script.

One of the major benefits of this detector is that it can use the computational power of our GPUs if one is available. This can make the detection pipeline a bit faster and take off the load from the CPU which can then focus on other tasks. If a GPU is not present, then by default, the model will be using the CPU for all the processing necessary, but the process becomes really slow.

Now, for this, we also need a pre-trained face detection model. You can find it in the official Dlib models GitHub repository.

Now we can initialize the detector with this model by calling the function cnn_face_detection_model_v1

detector = dlib.cnn_face_detection_model_v1(‘data/mmod_human_face_detector.dat’)

The rest of the code is the same as for the Dlib HOG+SVM script.

def DLIB_CNN_highlight_face(frame):
    faceBoxes = []
    resultImg=frame.copy()
    frameHeight=resultImg.shape[0]
    frameWidth=resultImg.shape[1]       
    
    detector = dlib.cnn_face_detection_model_v1(‘data/mmod_human_face_detector.dat’)    
    faces = detector(resultImg)     
    for face in faces:
        x1=face.left()
        y1=face.top()
        x2=face.right()
        y2=face.bottom()
        
        faceBoxes.append([x1,y1,x2,y2])
        cv2.rectangle(resultImg, (x1,y1), (x2,y2), (255,0,0), int(round(frameHeight/150)), 8)
        
    return resultImg, faceBoxesimg = cv2.imread('data/dw_10_2.jpg')
result, boxes = DLIB_highlight_face(img)
plt.imshow(result)

And the results are:

And on images with not only frontal faces:

I don’t have GPU on my machine, so the first run was very long, about 10–12 seconds. It means that if we want to include this algorithm in our pipeline and we don’t have a CPU, we need to upload the model before starting face detection.

What to choose

We’ve seen a number of face-detection algorithms along with the libraries that implement them. What algorithm/library should we choose? When we compare the methods, we rely on time and accuracy and those two factors are problem-dependent. I’m working on the VIdeo-KYC pipeline and I have two entry points for face detection: detection of faces on live video streams and later detection of faces on documents. For the first problem, the face detector should be very fast, but still efficient, because a live person usually makes some movements and turns their head. For the second problem, speed is not so critical, but the face on documents is usually clear and face-forward, so no heavy artillery is required. Also, don’t forget that face detection is not enough, faces should be recognized. Currently, I’ll choose the Dlib HOG+Linear SVM face detector for the whole pipeline — it is fast and worked well on slightly turned faces. And (spoilers!!!!) Dlib HOG+Linear SVM is the default face detector of face-recognition library that, as its name implies does face recognition, which I’ll talk about it in my next article.

Live face detection on the video stream

Now let's see how Dlib HOG+Linear SVM works on an online video stream. Here is the code that enables the camera, makes face detection and writes down the video:

#imports
import cv2
from matplotlib import pyplot as plt
import numpy as np#create video captioning object
cap = cv2.VideoCapture(0)fourcc = cv2.VideoWriter_fourcc(*'XVID')
out = cv2.VideoWriter('demo.avi', fourcc, 10.0, (640,  480))
#out = cv2.VideoWriter(name,fourcc, 20,(320,180),False)frames = []#faceCascade = cv2.CascadeClassifier('haarcascade_frontalface_default.xml')
flagD= Falsewhile True:    
    result, frame = cap.read()  
    frame, boxes = DLIB_highlight_face(frame)                           
    frames.append(frame)   
    out.write(frame)
    cv2.imshow("frame", frame) # qqqqThis will open an independent window       
    if cv2.waitKey(1) & 0xFF==ord('q'): # quit when 'q' is pressed        
        break# close the already opened camera
cap.release()    
out.release()
# Destroy all the windows        
cv2.destroyAllWindows()

And here is the recorded live demo. It is not so smooth because Dlib Hog+SVM is not the fastest face detector and maybe will be replaced later, but it is good enough for now.

Summary

In this article, we’ve made acquaintance with a number of face-detection algorithms and libraries implementing them, learned that choosing the algorithm/library is mostly related to the problem and the resources we have, and built a second component of the video-KYC pipeline. The next post of the series will be about face recognition.

Actimize

NICE Actimize leverages machine learning and AI to detect and prevent financial crimes across the financial services industry, including some of the largest global FIs. Our AI and analytics teams create models to detect anomalous activities associated with AML, Fraud, and market abuse.

The NICE Actimize KYC/CDD solution uses the latest technological innovations to provide complete customer lifecycle risk coverage — accounting for customer onboarding, ongoing due diligence, and enhanced due diligence (EDD) processes.