PyTorch and Face Liveness Detection: A Comprehensive Guide

5 min readJan 14, 2024

While the world increasingly turns its focus to the wonders of facial recognition technology, an equally critical aspect often slips under the radar — face liveness detection. It might not always be needed for facial recognition, but in situations where security and authenticity are crucial, face liveness is implemented.

A common example of face liveness is in eKYC (electronic Know Your Customer) processes, such as when creating an account for an e-wallet or online trading. The process might seem straightforward, where you need to capture your face. However, this entire procedure is an application of face liveness detection. Face liveness can be divided into two types: active and passive.

Active Face Liveness:

Requires user to do action such as blinking or smiling.
More challenging to spoof.
Longer completion time.
Could be problematic for users with disabilities affecting smiling or blinking.

Passive Face Liveness:

Operates using just the user’s image.
Easier to spoof.
Faster to complete.
More inclusive and accessible.

For this walkthrough, we will embark on creating a passive face liveness detection system using PyTorch, complemented by an example dataset available online. Our journey will unfold through the following steps:

Research Papers Overview.
Data Acquisition.
Train Teacher Model.
Pre-train Student Model.
Distillation.

Research Papers Overview

The models that we will use are Vision Transformer (ViT) integrated with additional architectures, SimMIM and Feature Distillation.

ViT:

Vision Transformer Architecture — ViT architecture from An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

ViT divides an image into patches of a predefined size and applies a linear embedding transformation to each patch. Additionally, it incorporates position embeddings to represent the spatial information of these patches. Finally, it feeds the resulting sequence of embedded vectors into a conventional Transformer encoder.

Shifted Window Approach from Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

We intend to utilize a ViT that incorporates elements of the Swin Transformer architecture. This includes the introduction of relative position bias and window-based self-attention. The relative position bias is designed to encode precise spatial relationships that are sensitive to the content of the image. Meanwhile, the window-based self-attention focuses on local features, as opposed to global features. This method means that the model has to use both the local attention and the relative position bias together to really get a good understanding of the whole image. The code we plan to use is a modified version based on this repository.

GFNet Architecture from Global Filter Networks for Image Classification

We will also experiment with the Global Filter, as used in GFNet. This approach is similarly utilized in the SpectFormer, where it combines both Spectral Blocks and Attention Blocks. For our experiment, we will create a variation that combines both Attention and Spectral elements within each block. We have chosen to use ‘Spectral’ instead of ‘Global Filter’ in our terminology because ‘Spectral’ sounds cooler.

Implementation of RoPE from RoFormer: Enhanced Transformer with Rotary Position Embedding

In addition to our current experimentation, we will integrate Rotary Position Embedding (RoPE) from RoFormer into our model’s architecture. RoPE uniquely encodes the absolute positions within an image using a rotation matrix, which is particularly effective in maintaining positional information. Simultaneously, it introduces an explicit representation of relative positional relationships within the self-attention mechanism of the model.

SimMIM:

SimMIM is a straightforward framework for masked image modeling. It employs only a single-layer prediction head to predict raw pixel values of randomly masked patches and conducts learning using a simple ℓ1 loss. The research paper conducted a study on ViT and Swin models. Since we will be using ViT, our focus will be on ViT parameters. The study experimented with different masking strategies and various masked patch sizes, ultimately achieving the best result with a random masking strategy using a patch size of 32×32 and a mask ratio of 0.5. However, for our use case, we will follow the default option opted in the study, which is a mask ratio of 0.6.

Feature Distillation:

This paper proposes the use of normalization, projection layers, and the soft maximum function for knowledge distillation. The entire process essentially involves adding linear projection and batch normalization to the student model, along with adding batch normalization to the teacher model’s feature output. Subsequently, it compares the feature space between the student model and the teacher model using the logsum distance metric. The loss function employed for this approach consists of both the task loss and the feature distillation loss.

Data Acquisition

For this walkthrough, we will use a publicly available dataset for experimentation. Specifically, we will be utilizing the NUAA dataset, which can be downloaded from here. Since our primary focus is on face liveness, we will use the second dataset, which already includes the output of face images.

Python

Before proceeding further, please create a Python environment and install the necessary Python packages.

pip install ellzaf_ml numpy optuna jupyter notebook tensorboard scikit-learn

Training teacher model will be continued in Part 2.