Indust.ai’s Computer Vision pipeline for Deepfake detection
Edit on June 22nd: Final results of the Deepfake Detection Challenge using the privately-held test set are in, and we ended up in the top 5% of teams (92nd out of 2114)! Thanks to Bastien, Benoît-Marie, Jean-Baptiste, and Maxime for making this happen!
In the last couple of years, deepfakes, or deep-learning generated manipulated videos, have emerged as both a significant social and political issue. They have also become a headache for tech and media platform companies that are now facing increasing pressure to remove misleading content. As the technology for making realistic deepfakes improves, the ability to automatically spot them is becoming increasingly important.
As detecting deepfakes is very much still an unsolved problem, Facebook, Microsoft, AWS, and the Partnership in AI joined forces to create the global DeepFake Detection Challenge on the Kaggle platform. With a sizeable 1 million USD prize pool, their objective was to attract top AI talent from around the world to compete to find ways of improving deepfake detection.
At indust.ai we were keen to help address this issue. As a newly constituted company with a mission to help European companies transform their core business activities using AI, we saw this challenge as a great opportunity. It was a perfect way for our new team to get used to working together, to help our more junior members acquire valuable skills in computer vision, and to confirm our readiness to solve important problems for industry.
In this post, by sharing the details of the machine learning pipeline that got us into the top 10% of all contestants, we want to share insight into what we found worked well and what did not for this challenge. You can also find a more process-oriented, non-technical overview of what we did here.
Exploring and preparing the data
The training dataset provided consisted of 120 000 videos, with about 500 different hired actors performing in multiple different 10 second long clips. The faces of the actors in these clips were then modified using one of several deepfake generation techniques, so that the final dataset had about 80% fake videos and 20% real videos. The resulting dataset was by no means clean, and the first step we took was to remove duplicates and corrupted files.
A major difficulty with this challenge was that the process that generated the test videos used for the leaderboard scores was completely different to that used to generate the train videos, as described in the paper  that accompanied the challenge. This caused the train and test distributions to be different, so it would prove essential to both apply good methods for generalization and to use a good training/validation split. The videos were split into 50 folders, with not much overlap between the actors in different folders. We chose to use 40 folders for our train set and 10 for our validation set, thus minimizing overlaps between actors in both sets while having a sufficient number of actors in our validation set to produce a reliable validation score.
Presenting our ML Pipeline
The first choice we had to make involved what to process: should we focus on the entire videos, or on individual frames sampled from the videos? This choice would affect the type of machine learning model we used, the computation time, and probably the final performance. After observing that many fake videos contained obvious facial manipulations and could be identified from still images, we decided early on to focus on detecting a deepfake from individual frames.
Our decision to focus on frames informed our choice of pipeline, shown above. We first created a balanced training set by sampling 10 frames from real videos and 2 frames per fake video. We then cropped these images to focus on the faces, used transfer learning to train a pretrained model on this face dataset, and finally for inference we used ensembling with several models and video frames.
We reached the top 10% of participants thanks to a combination of choices we made throughout the ML pipeline, focusing on:
- face detection
- model selection
- generalization methods
- frame and model ensembling
We’re going to explore these aspects in more detail below.
Since the video manipulations focused on people’s faces, which were only a small part of the videos, it was essential to develop a good face detection system. Specifically, we sought a face detection system that would maximize the detection probability (given a frame, the probability of detecting a face) and minimize the false positive rate (the probability that what is detected is not a face).
We benchmarked and compared several different open source face detectors, and ended up testing two different approaches:
- Face detection with no false positives. We hypothesized that noise in the data created by frames wrongly identified as faces could adversely affect training. For this approach, we used a combination of two different face detectors: faced  and dlib . By requiring that a face must be detected by both detectors, we reached 0.1% false positives, which came at a cost of a rather low detection rate of 60%.
- Balanced false positive rate and detection probability. For this approach, we used the BlazeFace  detector and immediately reached 97% detection rate and 7% false positives. However, we noticed that this detector performed poorly when the faces were far from the camera. By adding an additional step in which we created horizontal, vertical, and matrix sliding windows with overlapping cells, and then taking those windows to perform face detection, we reached a 99% detection rate with 8% false positives.
We found that the second approach yielded better results on the Kaggle leaderboard, and then adopted it as our baseline for face detection. We parallelized this approach to make it faster than what other contestants were reporting, and were extracting frames from videos, detecting faces, and performing model inference at 25 frames per second in the Kaggle environment.
Choice of model
By choosing to focus only on frames as opposed to entire videos, we had reduced the challenge to an image classification problem and could rely on pre-existing methods. Specifically, we used transfer learning¹ with Convolutional Neural Networks (CNNs) trained on the ImageNet classification task . Although the ImageNet dataset and the Deepfake dataset are very different, all natural images share common features, and we expect that starting with a pretrained model allows it to focus on the specifics of the problem at hand.
As we tested several models, we found a correlation between the model’s results on ImageNet and the validation loss on our dataset. As a result, we selected two of the more recent and best performing ImageNet models: ResNext  and DenseNet .
The biggest problem we had to solve in this DeepFake challenge was that of generalization, as the training data provided and the test data used for the leaderboard came from different distributions. Consequently, we found that there was often a significant gap between our validation score and the leaderboard score.
We combined three different approaches to improve generalization:
- Regularization: we tried standard methods such as dropout², weight decay³ and label smoothing⁴. However, we found that these methods only yielded a small improvement.
- Adding more data: to increase the diversity of images in our training set, we added data from a different source. Specifically, we used the FaceForensics dataset , released in 2018, which is also focused on deepfake detection.
- Data augmentation: We synthetically created more data by randomly introducing slight variations to the existing images using translation, rotation, brightness variation, compression, and resolution reduction.
We found that the data augmentation techniques led to the highest improvement in our score.
To make predictions on videos, we performed two types of ensembling:
- Frame ensembling: we randomly sampled frames from a video, performed face detection, made individual predictions for each face detected, and then used the average over all the predictions as our prediction for the video. We started out doing this averaging with 5 frames per video, but quickly realized the more frames we used, the better. In our final submission, we were taking the average over 100 frames per video.
- Model ensembling: since different models tend to learn slightly different patterns in the data, to achieve better generalization we averaged the predictions made by several different ResNext and DenseNet models.
The use of both types of ensembling significantly improved our score, and made us shoot up about 100 places in the ranking.
What we did and did not try
There were a few other things we tried that ended up not making the cut in our final pipeline:
- We hypothesized that we may be missing out on important information captured in the temporal dimension of the video. We thus trained a Recurrent Neural Network (RNN) on top of the pre-trained Convolutional Neural Network (CNN) with sequences of frames as input. However, we found that our results were not very different from training just a CNN and averaging several predictions. We thus concluded that most information about the real or fake nature of the data could be found in individual frames.
- We hypothesized that simply averaging predictions made by a CNN over several frames of a video was suboptimal, and that better results could be reached by training a separate network to aggregate the predictions. However, despite having a huge impact on our validation score, we found that this procedure led to overfitting to our train dataset and worse generalization to the test dataset.
As a small team working to a deadline, we had to make difficult decisions about what to prioritize. Some things we did not try but could have improved our score are:
- We knew from the start that around 5% of the videos were fake because of manipulations not of the image, but of the audio. Although we decided to focus on the visual fakes, extra gains could have been made, either by sticking with our approach and cleaning our train dataset to remove the fake audios, or by building a more comprehensive model that could deal with both fake audios and fake visuals.
- Training a classifier with an auxiliary (but relevant) objective is known to often lead to improvement. For instance, we could have trained our model to not only classify the video frames into fake and real categories, but also predict which pixels were most likely to have been altered.
- We expect that more data augmentations and fine tuning of our models and pipeline would have led to improved generalization. For example, we found after the competition had officially ended that a wider crop around the faces we detected would have led to a better score.
We finished the competition placed in the top 10% of 2,281 teams! The plot above shows the evolution of our leaderboard score (in yellow, smaller is better) and of our ranking among competitors (blue, higher is better). Our score consistently improved over time, as we incrementally tested different approaches and made changes to our pipeline.
Working on the DeepFake detection challenge was a great way for our new team to get used to working together, share knowledge from our different backgrounds and experiences, and upskill in computer vision. Our final score is a testament to our ability to come together and solve difficult and important problems.
The end of the competition is only one step on our journey in applied computer vision. We are keen to discuss computer vision and how it can be applied to business challenges, so do get in touch to set up a call with us! You can also read more about how we worked together as a team despite the coronavirus crisis in our accompanying non-technical article.
References and Glossary
 DeepFake Challenge: https://arxiv.org/abs/1910.08854
 dlib C++ Library
 ResNext: https://arxiv.org/abs/1611.05431
 DenseNet: https://arxiv.org/abs/1608.06993
¹ transfer learning is a machine learning method where a model developed for a task is reused as the starting point for a model on a second task.
² dropout refers to randomly dropping subsets of neurons during the training phase
³ weight decay involves adding L2-regularization on the parameters to the loss during training
⁴ label smoothing involves replacing the binary 0 and 1 labels of the data (for real or fake) with other values such as 0.01 and 0.99 to avoid overconfident predictions