Detecting Audio Deepfakes With AI
Imagine the following scenario…
Your phone rings, you pick up. It’s your spouse asking you for details about your savings account — they don’t have the account information on hand, but want to deposit money there this afternoon. Later, you realize a bunch of money has went missing! After investigating, you find out that the person masquerading as them on the other line was a voice 100% generated with AI. You’ve just been scammed, and on top of that, can’t believe the voice you thought belonged to your spouse was actually a fake.
Until recently, if we told you this you’d think we were in desperate need of a good night’s sleep. But with AI-powered synthetic media like deepfakes on the rise, scenarios like the one above are already starting to happen.
Take, for instance, the world’s first AI-powered cybercrime reported on earlier this September. Using speech synthesis technology, thieves were able to convince an energy executive into thinking he was on the phone with his parent company’s CEO, tricking him into wiring over $250,000 into their account.
To build public awareness about the risks of AI-powered speech synthesis, a few months ago we shared our own example of the technology with the public. Using a proprietary speech synthesis model they built called RealTalk, engineers at the company recreated the voice of the popular podcaster Joe Rogan.
Here’s the video we shared featuring the fake Rogan voice back in May:
Today, we’re sharing the next step of our engineers’ work—a detector system built to discern between real and fake audio examples.
Detecting Audio Deepfakes With AI
As mentioned above, malicious uses of deepfakes are not only terrifying, but actually beginning to happen. Building tools that can accurately discern between real and fake media is an increasingly urgent matter. As machine learning practitioners, we have the capabilities to do this, and can help mitigate a real-world problem with drastic consequences.
How it works:
To discern between real and fake audio, the detector uses visual representations of audio clips called spectrograms, which are also used to train speech synthesis models. You can read more about how spectrograms help create synthesized audio in our technical post on speech synthesis here.
While to the unsuspecting ear they sound basically identical, spectrograms of real audio vs. fake audio actually *look* different from one another.
We trained the detector on Google’s 2019 AVSSpoof dataset, released earlier this year by the company to encourage the development of audio deepfake detection. The dataset contains over 25,000 clips of audio, featuring both real and fake clips of a variety of male and female speakers.
The deepfake detector model is a deep neural network that uses Temporal convolution. Here’s a high-level overview of the model’s architecture:
First, raw audio is preprocessed and converted into a mel-frequency spectrogram — this is the input for the model. The model performs convolutions over the time dimension of the spectrogram, then uses masked pooling to prevent overfitting. Finally, the output is passed into a dense layer and a sigmoid activation function, which ultimately outputs a predicted probability between 0 (fake) and 1 (real).
Dessa’s baseline model achieved 99%, 95%, and 85% accuracy on the train, validation, and test sets respectively. The differing performance is caused by differences between the three datasets. While all three datasets feature distinct and different speakers, the test set uses a different set of fake audio generating algorithms that were not present in the train or validation set.
Put more simply, our detector model can currently predict over 90% of the fake audio clips it is shown.
Build your own deepfake detector
Since detecting audio deep fakes is a mission critical problem, we’ve open-sourced the code for our model to encourage others to develop their own fake voice detection models.
We’ve provided all pre-processed data, training code and inference code to make things as accessible as possible. We’ve also included a walkthrough of the code to help get started. The code works easily with the free Community Edition of our ML development platform, Foundations Atlas. Atlas will make viewing spectrogram and audio artifacts as well as running hyperparameter searches simple.
- Code, and detailed instructions: Download on our Github.
- Install Atlas: Download the free Community Edition of Atlas.
The hope is that by releasing the tutorial, other ML practitioners will be able to reproduce the results of our model, in turn coming up with new ideas on how to make it even more impactful.
In a future world, our vision for a model like this is a kind of system that could fit into the real-world infrastructure powering our phones and other media. Going back to that first scenario we told you about, for example: imagine if before you got the call, your phone buzzed violently, letting you know that the voice on the other line wasn’t actually your spouse, but a deepfake.
Here are some other ideas that you can use to jumpstart your own model-building efforts:
- Build a low latency serving system to run in real-time
- Try more interesting data augmentations
- Add audio from other speech datasets
- Add audio sourced from ‘the wild’ (e.g. from YouTube)
We’re excited to hear more about what you build. Learn how to get in touch with us and share your results below.
Ready to build? If you end up working on new experiments related deepfake detection using our code and Atlas, we’d love to hear about your results. Share them with us by commenting on our Github.
Learn more: You can also learn more about the detector and how it relates to a bigger industry push to combat against deepfakes over on Axios, where our work was covered in their excellent Axios Future newsletter.