Face Anonymization Pipeline in Pytorch

Published in

Jumio Engineering & Data Science

10 min readNov 5, 2021

Introduction

Protecting data privacy is critical to preserving customer trust and is also gaining increasing attention from policy makers. Staying ahead of these expectations requires continual improvements to AI toolchains. Anonymizing image data is particularly challenging without badly degrading the quality of the image samples. We developed the capability to anonymize images while preserving the image distribution, giving us an excellent way to maintain the anonymity of the persons in the images while still performing data augmentation tasks.

Our approach is based on the paper, “DeepPrivacy: A Generative Adversarial Network for Face Anonymization,” published in 2019 at the International Symposium on Visual Computing. The authors describe a face anonymization pipeline composed of three blocks: a face detection block, a key point extraction block, and a face generation block. The face detection block defines a bounding box for each face in an image. For ID pictures it recognizes all faces including the ghost pictures. The key points extraction block identifies the key points of each face so the face generation block can mimic the pose in the anonymized face. Finally, the face generation block starts with an arbitrary background of the detected face (with the sensitive face information removed) and the detection key points and then generates a face that fits seamlessly into the larger image. When all three blocks are combined, the pipeline anonymizes images to generate an ID picture that is safe to use for multiple data science and machine learning applications. Our implementation employs improved key points and face generation blocks to create a more capable and efficient pipeline.

Face Detection

Our face detection block employs a pre-trained Dual-Shot-Face-Detector (DSFD) architecture. This architecture uses a Feature Enhance Module (FEM) to improve the original feature maps from a single shot detector. This module enhances the original feature maps by using different dimension information of the upper layers and the non-local cells in each of the feature maps. This process should yield more robust and discriminable features. The DeepPrivacy paper describes the use of the same pre-trained model. We found the working version of DSFD-Pytorch-Inference produces extremely good results.

Left: Representation of the inner working of the DSFD algorithm, Right: Generated output sample: Link .

Key Points Extraction

We chose a deep learning model for this block trained to extract specific key points of the human face. The estimated key points are passed to the face generation block to guide the creation of a synthetic face with a pose and expression similar to the original. This helps avoid situations such as generating a face with a yelling expression when a country mandates neutral expressions in ID images. It also helps the generator place the eyes, nose, and mouth in the correct positions for the overall structure of the face. We chose transfer learning with a ResNet101 model pre-trained on ImageNet to predict nine different key points as illustrated in the image below.

Detected key points on two publicly available images; the keypoints are: Left EyeBrow, Right EyeBrow, Left Eye, Right Eye, Nose, Upper Lip, Lower Lip, Left Lip, Right Lip. (Left Link | Right Link)

We found these nine keys are optimal for generating the same pose as the original photo. Increasing the number of points limits the diversity of the faces while restricting the number of points results in a loss of some elements of the pose. The DeepPrivacy baseline uses a pre-trained Mask-RCNN network, which acts as a framework for object detection, segmentation, and keypoints detection. This pipeline, however, performs face detection twice, and we sought a solution incorporating additional key points to the three detected by this algorithm. Additionally, training a better Mask-RCNN requires a dataset containing both the bounding boxes of faces and their corresponding key points, which we did not have.

We chose to train our model on the Youtube Faces Dataset with Key points annotations (https://www.kaggle.com/selfishgene/youtube-faces-with-facial-keypoints), a set of 3462 training files and 2308 testing files. Each file is annotated with 68 different key points, though we only kept the nine mentioned above. In order to more accurately reconstruct mouth poses, we supplemented the training with approximately 600 manually annotated images from the smile-detection dataset. We also incorporated some traditional image augmentation techniques such as scaling, cropping, rotating, playing with the saturation, hue, etc. All the images are in RGB format and are 224x224 pixels in size.

The Mean Squared Error loss on the validation dataset converges around 19.24. This means the estimated key points located on the validation images are, on average, four to five pixels off of the ground truth. Considering the size of the images, this is very near to our desired range. Reducing this metric further becomes impractical as the centers of key points like eyebrows may be somewhat arbitrary. Similar experiments with ResNet34 and ResNet50 yielded an MSE of 25.96 and 22.03 respectively.

Face Generation

Face generation relies on two key algorithms. The first is Progressive Growing of Gans, which we incorporated from the baseline. To augment the results of this algorithm we applied an adapted version of StyleGan2. We trained both architectures with 27,000 images of the 30,000 image CelebAHQ dataset. Calculating the evaluation metrics consumed the remaining 3,000 images. As of test time, the generated images have a resolution of 128x128 pixels.

Progressive Growing of Gans

The architecture, often called “ProGAN,” starts with generating low-resolution images and then increasing the resolution through subsequent layers until it reaches a pre-set target. Training the initial layer to generate low-resolution images is comparatively easy. This then forms a guideline for the additional layers in the network to generate larger faces while progressively refining the details.

Representation of the procedure of training a ProGAN; Link.

Producing faces using pose estimations and having seamless transitions to the background required a few tweaks to the ProGAN architecture. We took a cue from the DeepPrivacy paper and started with a U-net generator to keep background information. For each upsampling pass in the decoder we concatenate the current feature maps with the key point pose information (one hot mask) and the corresponding skip connection to preserve the background information.

Representation of the Generator of DeepPrivacy; Link.

The discriminator consists of a simple binary encoder architecture into which we pass background information and also the one-hot masks of the key points estimation after each downsampling layer. This teaches the discriminator that a real picture has the appropriate background and that the key points should be at specific features, thereby guiding the generator to output similar images with the correct backgrounds.

StyleGAN2

The StyleGAN2 architecture, proposed in late 2019 (https://arxiv.org/abs/1912.04958), achieves state-of-the-art results in generative image modeling. The original StyleGAN architecture has a strong location preference for certain facial features that is sometimes undesirable. The StyleGan2 architecture is similar to MSG-GAN. It achieves the same goal of focusing on low-resolution images first and focuses progressively on larger images with more features. The revised approach also swaps the Adaptive Instance Normalization technique for Weight Demodulation, gaining both speed and efficiency. The authors suggest incorporating Path Length Regularization to enforce smoother latent space interpolation. This is extremely valuable in the context of ID anonymization for producing images containing specific features.

Our code is based on the StyleGan2-pytorch project. As with ProGAN, we preserve background information by encoding the original (anonymized) image. We then concatenate the feature maps of the decoder and the key points estimation at each resolution in the generator and the discriminator.

Representation of the changes made at StyleGan to obtain StyleGAN2 (c & d), Link.

We experimented with two variations on this model, which we refer to as StyleGAN2-A and StyleGAN2-B below. The following table describes the basic characteristics of each of the training experiments. We decided to stop training StyleGAN-A at 150 epochs simply because the results were not as promising as StyleGAN-B.

Basic training characteristics of both StyleGAN training jobs.

Discussion

We trained all the algorithms (except for the baseline: DeepPrivacy) on Sagemaker with a 4-GPU instance for a combined total of over 300 hours of computing time. We used different architectures in calculating the below metrics to find quantifiable indicators for a proper analysis. We used the validation dataset (3000 images) to calculate them all.

Frechet Inception Distance: First proposed in 2017, this metric evaluates the quality of images created by a generative adversarial network. The lower the FID score, the closer the distributions of the generated and original samples. We aim for a low score here to indicate a high quality generated image.

Face Matching Metric: For this metric we chose to employ FaceNet, which was developed in 2015 by Google researchers. This pre-trained architecture outputs a vector of a certain length for each face where the euclidean distance between the vectors is linearly correlated to the similarity between the faces. Images of the same person should result in a distance of about 0.5 while images of different people should be greater than 1.0. We aim for a high distance between the original face and the generated face to ensure it is truly anonymized.

Key Points Distance: For this metric we calculate the mean distance between key points on the generated image and the original image. A small measure here indicates the pose of the generated image is similar to the pose of the original.

Generation Failures Percentage: This metric uses the DSFD network to check whether the generated image has a recognizable face. We count it as a failure if no faces are detected.

Results

Key metrics of the different architectures trained on the test set described above.

Right away we notice the StyleGan2-B experiment shows the lowest FID by a fairly large margin as compared to ProGAN and the deep privacy baseline. Our enhanced keypoint detector, estimating based on nine key points rather than the three used by DeepPrivacy, likely plays a significant role in producing images that are similar to the original distribution. StyleGan2-B also shows a huge improvement in Key Points Distance over DeepPrivacy.

While DeepPrivacy scored best on the Face Matching Metric, the other models still score significantly higher than would be found with different pictures of the same face. ProGAN showed the best percentage of failures, but both StyleGAN2 experiments showed results much closer to ProGAN than DeepPrivacy. Overall, we found our experiments are much less prone to failures than DeepPrivacy while still providing adequate anonymization. The results of the StyleGAN2-B experiment are the best overall, and we chose to use this architecture for our pipeline.

Interpolation in the Latent Space

One interesting capability with GAN is the ability to use vector arithmetic in the latent space to put constraints on the generator to target a desired feature in the synthetic image. We generated many images with random style vectors and kept only the ones that created a prominent smile, for example. We averaged the extracted latent spaces to arrive at a relatively accurate representation of a specific feature. In many countries ID pictures are non-smiling, so we can ensure we generate pictures with non-smiling faces if desired. We can also extract other features like big/small eyes, big/small nose, eye color, mustache, etc. Below is a sample representation of a generated face where we gradually change the latent vector from a neutral to a smiling vector.

Sample interpolation in the latent space from a neutral latent vector (Left) to a smiling latent vector (Right).

Final Touches to the Pipeline

We also made a few optional adjustments to the pipeline to better fit the requirements of Jumio’s augmentation tooling. First is the incorporation of Poisson blending instead of directly pasting the generated face into the larger image (see Seamless Cloning using OpenCV). This provides a smoother image due to minor color corrections performed during blending. The second option is for generating only one face per image and using the generated face to replace all detected faces. The ghost photo is typically required to be the same as the main picture but with variations like hue and saturation. After face detection we detect the key points and then generate the face of the largest bounding box detected. We then rescale the generated faces and use Poisson blending so the color and shape fit all the faces. The final option is to add the possibility of generating only faces with neutral poses using a special latent vector. This is required for the IDs of many countries. Here are some representative results of our pipeline on publicly available sample ID images.

Representative results of our pipeline on 3 different sample IDs publicly available ( Link1, Link2, Link3).

Conclusion

Our model shows a marked improvement over DeepPrivacy in anonymizing the face of images. Implementing a detector operating on nine key points and the use of StyleGan2 as the generator model seem to significantly improve the performance of the pipeline. On our testing metrics, we significantly improved the FID, Key Point Distance, and the percentage of failures compared to the DeepPrivacy baseline. This set of networks is relatively complex but can be a great tool for data augmentation and especially data balancing. Additionally, depending on the needs of the user, some optional changes and improvements can be added to the pipeline to improve the results. For example, basic data augmentation techniques can enhance the output of the pipeline. We are currently training (fine-tuning) the model with the FairFace dataset so the generated images incorporate greater diversity. One major future enhancement to the pipeline includes changing the skin color of the person using a latent vector to make the dataset less biased and more inclusive.