DeepFake Detection Challenge: an Overview
At MLJCUnito we deeply believe that, first of all, AI should be ethical. That’s why a few months ago we decided to take part into DeepFake Detection Challenge: a joint contribution from Amazon AWS, Facebook and Microsoft. Deepfake techniques, in short, present realistic AI-generated videos of people doing and saying fictional things. They possess the potential to have a significant impact on how people determine the legitimacy of information presented online, and, as citizens we should be well aware of this technology.
DeepFake: a threat to democracy or just a bit of fun?
“We are already at the point where you can’t tell the difference between deepfakes and the real thing,” Professor Hao Li, University of Southern California
Facebook has announced it will remove videos modified by artificial intelligence, known as deepfakes, from its platform.
Kaggle is an AirBnB for Data Scientists
This is where they spend their nights and weekends. It’s a crowd-sourced platform to attract, nurture, train and challenge data scientists from all around the world to solve data science, machine learning and predictive analytics problems. It has over 536,000 active members from 194 countries and it receives close to 150,000 submissions per month. Started from Melbourne, Australia Kaggle moved to Silicon Valley in 2011, raised some 11 million dollars from the likes of Hal Varian (Chief Economist at Google), Max Levchin (Paypal), Index and Khosla Ventures and then ultimately been acquired by the Google in March of 2017. Kaggle is the number one stop for data science enthusiasts all around the world who compete for prizes and boost their Kaggle rankings. There are only 94 Kaggle Grandmasters in the world to this date.
Do you know that most data scientists are only theorists and rarely get a chance to practice before being employed in the real-world? Kaggle solves this problem by giving data science enthusiasts a platform to interact and compete in solving real-life problems. The experience you get on Kaggle is invaluable in preparing you to understand what goes into finding feasible solutions for big data.
Fine, fine, fine but what do we do on Kaggle? We Learn
Deepfakes are fakes generated by deep learning. So far so easy.
This usually means someone used a generative model like an AutoEncoder or most likely a Generative Adversarial Network, short GAN. GANs are technically two networks that work against each other, illustrated below. The artist ( generator) draws its inspiration from a noise sample and creates a rendering of the data you are trying to generate with said GAN. The private investigator ( discriminator) randomly gets assigned real and fake data to investigate.
The learning process is collaborative. The generator gets better at fooling the discriminator and the discriminator gets better at figuring out which data is real and which isn’t. In mathematical terms they are learning until a Nash equilibrium is reached, which means neither can learn new tricks and get better. They’re a really cool concept and even used in scientific simulation at CERN.
You can probably guess that they can be tricky to train, due to so many moving parts. This has become a very popular area of research, warranting a GAN Zoo of all named GANs. Some important stuff you may want to check out if your interested are keywords like Wasserstein GANs, Gradient Penalization, Attention, and in this context: Style Transfer (namely face2face).
It sounds absurd, I know. Here you can find some more practical examples, why don’t you play with them for a while?
A few tips for the Official Challenge on Kaggle
- I strongly encourage you to start first with the official Getting Started guide here.
- What is the goal of the Deepfake Detection Challenge? According to the FAQ “The AI technologies that power deepfakes and other tampered media are rapidly evolving, making deepfakes so hard to detect that, at times, even human evaluators can’t reliably tell the difference. The Deepfake Detection Challenge is designed to incentivize rapid progress in this area by inviting participants to compete to create new ways of detecting and preventing manipulated media.”
- In this Code Competition:
- CPU Notebook <= 9 hours run-time, GPU Notebook <= 9 hours run-time on Kaggle’s P100 GPUs, No internet access enabled
- External data is allowed up to 1 GB in size. External data must be freely & publicly available, including pre-trained models
- This code competition’s training set is not available directly on Kaggle, as its size is prohibitively large to train in Kaggle. Instead, it’s strongly recommended that you train offline and load the externally trained model as an external dataset into Kaggle Notebooks to perform inference on the Test Set. Review Getting Started for more detailed information.
Submissions are scored on log loss:
where:
- n is the number of videos being predicted
- is the predicted probability of the video being FAKE
- is 1 if the video is FAKE, 0 if REAL
- log() is the natural (base e) logarithm
Data
- We have a bunch of .mp4 files, split into compressed sets of ~10GB a piece. A metadata.json accompanies each set of .mp4 files, and contains filename, label (REAL/FAKE), original and split columns, listed below under Columns.
- The full training set is just over 470 GB (Yeah it’s huge !).
There are 4 groups of datasets associated with this competition.
Training Set: This dataset, containing labels for the target, is available for download for competitors to build their models. It is broken up into 50 files, for ease of access and download. Due to its large size, it must be accessed through a GCS bucket which is only made available to participants after accepting the competition’s rules. Please read the rules fully before accessing the dataset, as they contain important details about the dataset’s permitted use. It is expected and encouraged that you train your models outside of Kaggle’s notebooks environment and submit to Kaggle by uploading the trained model as an external data source.
Public Validation Set: When you commit your Kaggle notebook, the submission file output that is generated will be based on the small set of 400 videos/ids contained within this Public Validation Set. This is available on the Kaggle Data page as test_videos.zip
Public Test Set: This dataset is completely withheld and is what Kaggle’s platform computes the public leaderboard against. When you “Submit to Competition” from the “Output” file of a committed notebook that contains the competition’s dataset, your code will be re-run in the background against this Public Test Set. When the re-run is complete, the score will be posted to the public leaderboard. If the re-run fails, you will see an error reflected in your “My Submissions” page. Unfortunately, we are unable to surface any details about your error, so as to prevent error-probing. You are limited to 2 submissions per day, including submissions with errors.
Private Test Set: This dataset is privately held outside of Kaggle’s platform, and is used to compute the private leaderboard. It contains videos with a similar format and nature as the Training and Public Validation/Test Sets, but are real, organic videos with and without deepfakes. After the competition deadline, Kaggle transfers your 2 final selected submissions’ code to the host. They will re-run your code against this private dataset and return prediction submissions back to Kaggle for computing your final private leaderboard scores.
Review of Data Files Accessible within kernel
- train_sample_videos.zip — a ZIP file containing a sample set of training videos and a metadata.json with labels. the full set of training videos is available through the links provided above.
- sample_submission.csv — a sample submission file in the correct format.
- test_videos.zip — a zip file containing a small set of videos to be used as a public validation set. To understand the datasets available for this competition, review the Getting Started information.
- filename — the filename of the video
- label — whether the video is REAL or FAKE
- original — in the case that a train set video is FAKE, the original video is listed here
- split — this is always equal to “train”.
Detection Starter Kit
A quickstart guide on DeepFakes: “DeepFakes and Beyond: A Survey of Face Manipulation and Fake Detection
Here it follows a Deep Fakes video EDA (Exploratory Data Analysis). It relies on static FFMPEG to read/extract data from videos.
- It extracts meta-data. They help us to know frame rate, dimensions and audio format (we can forget leak of “display_ratio” as it will be fixed).
- It extracts frames of videos as PNG.
- It extracts audio track as AAC (disabled).
- It compares a few face detectors (OpenCV HaarCascade, MTCNN). In the end we opted for RetinaFace indeed.
- It provides basic statistics on faces per video, face width/height and face detection confidence. It computes an average face width/height.
We notice that face detection (with OpenCV currently) is far from being perfect. An additional stage to clean-up detected faces is required before training a model! Maybe some kind of votes/ensemble with different detectors would help.
In this kernel you will see also some interesting edge cases of face detection:
- Face detected on a t-shirt.
- Face detected on a background board.
- Face detected inside a face.
What is ffprobe indeed? Basically, ffprobe gathers information from multimedia streams and prints it in human — and machine — readable fashion.
A few info on bitrate
Try to run the code by yourself, you can find a Jupyter notebook file on my Github. If you explore the same repository you’ll find the code we used for the whole pipeline, soon we’re going to convert it in a series of articles.
Thank you for your attention, if you like our projects follow us on Medium and Github. Soon we’ll be out with more articles.