We Just Created The Deepest Deepfake Yet. Here’s Why.

Raising Awareness About The Dangers of Synthetic Media

Ragavan Thurairatnam
10 min readNov 25, 2019
If you think this is an image of Joe Rogan, think again. It’s actually a frame from a hyper-realistic deepfake we created, overlaid with an AI facsimile of his voice. Learn more after the jump.

TL;DR: We’re raising public awareness about the power and danger of deepfakes by creating a hyper-realistic deepfake that’s 100% generated with AI. The work is now the subject of a television episode by The Weekly, The New York Times’ television show, which takes a close look at our process creating it and deepfakes’ implications for society. To mitigate the risk that sharing this deepfake example presents, we won’t be releasing the model, code or data used to create it publicly.

During our work on this project, we have also made a significant contribution to the field of deepfake detection, making it easier to detect appearances of real-world deepfake videos. Earlier today, this work was also covered in The New York Times by the newspaper’s AI Reporter Cade Metz, find the article here.

Introduction

In a small meeting room on a Tuesday morning, I showed a short video we created to the people in the room. After it finished I glanced over to find the first and only journalist to win four Pulitzer Prizes with his head in his hands, contemplating the end of truth and reality as we know it. How did we get to this place?

A new version of the deepfake video we initially shared with The NYT is now available on YouTube.

Earlier this summer, our company Dessa became known around the world for making the most realistic AI voice to date, which replicated one of the most famous voices in podcasting, Joe Rogan. The AI voice, which was built using a text-to-speech synthesis system we developed called RealTalk, could say any words we wanted it to say, even if Rogan had never said them before.

We didn’t just do this for kicks. One of our main objectives with releasing RealTalk was to get the public to take notice of just how realistic deepfakes were becoming, and to understand the tremendous implications at hand for society as a result.

It worked.

Since we released RealTalk in May, we have been shocked by the huge wave of public reception to it. More than 2 million people have viewed our video featuring clips of the faux Rogan voice on YouTube. Awareness about the work also quickly spread around the world, with journalists from over 20 different countries covering it (you can find a selection of the news coverage here).

One of the journalists who became interested in the work was David Barstow, the journalist we mentioned earlier, who emailed us shortly after we shared RealTalk publicly. At first, I had to do a double take to see whether or not the email was a prank. Here was someone who had taken on some of the world’s biggest stories (most recently, revealing Trump’s tax evasion schemes), getting in touch with us, wanting to work with us on a story about deepfakes.

Not nervous or heart-attacky at all, I agreed to have a call with him the same day. David said he had first become interested in learning more about deepfakes after seeing his son showed him a video of a now legendary deepfake that Jordan Peele had created of Obama, and was thinking of their implications especially as the 2020 American election drew nearer.

During the call, David asked me if we had thought about pairing synthetic audio with synthetic video to create a fully fledged deepfake (one that would be nearly impossible to discern as fake), and likened such a feat to an Edison moment. If the team were to attempt such a project, he wondered, would we also be interested in being filmed during the process, and ultimately becoming the subject of a documentary on The Weekly?

We had to pinch ourselves after that initial phone call. But we were confident that if anyone could distill why it was important for the world to know about deepfakes, it was David Barstow. And the truth was, it was like he had read our minds. Shortly after releasing RealTalk, and before we had that call, we had already started to think about merging audio and video as a next phase of the project, certain it would help us in our mission to spread even more awareness about the power and danger of deepfakes.

After that call with David, we quickly decided that opening our doors to The New York Times in the way we did was a no-brainer. Having the work showcased in a documentary by The Times would mean that millions more people would become aware of deepfakes and their implications. And we would have the credibility of The New York Times behind us. It really seemed like the ultimate soapbox for our message.

At the same time, along the way we were starkly aware of the risk it simultaneously posed to our reputation. What if they took our work, but chose not to share the intentions we had behind it? There was a huge risk that we would come across as evil geniuses or, less worse, idealistic tech bros. But ultimately, we determined that this was a risk worth taking.

Six months later, we’re excited to share this work with the world.

It may not seem like it, but the timeline we set aside to create this work was a very short one (especially since we were juggling many other projects simultaneously throughout this period). To maximize our control of the message, however, we knew we would have to act fast, since we were confident many others with machine learning expertise would be working on similar projects at the same time we were.

The unfortunate reality is that at sometime in the near future, deepfake audio and video will be weaponized. Before that happens, it’s crucial that machine learning practitioners like us who can spread the word do so proactively, helping as many people as possible become aware of deepfakes, and the potential ways they can be misused to distort the truth or harm the integrity of others. Ultimately that is why we decided we had to be the first to reach the finish line.

At this point, it’s also worth noting that we will not be sharing the model, code or data used to create the deepfake publicly. You might be thinking that this sounds extreme, unfair to ML researchers, or just a rehashing of Open AI’s decision to do a staged release of the GPT-2 language processing model (the AI that was “too dangerous to release”) first announced earlier this year.

It isn’t, because the reality is that tools for generating synthetic audio and video (and especially for generating them in tandem) are a lot more dangerous. If we were to open source the tools we used to make RealTalk, they could actually be used to scam people as individuals, or disrupt democracy on a mass scale, and it wouldn’t be that hard. In contrast, for an individual to use a model like GPT-2 to spread disinformation, they would need to somehow make tons of fake news websites, and also get traffic to them.

That said, there are select pieces of information we can share about the technical process of creating the work we feel comfortable sharing. Here is a brief overview of the building blocks we used to create our hyper-realistic deepfake video of Joe Rogan announcing the end of his podcast.

Technical Overview

Audio: To synthesize the audio used in our final version of the video, we used the same RealTalk model we developed to recreate Joe Rogan’s voice with AI back in May. RealTalk uses an attention-based sequence-to-sequence architecture for a text-to-spectrogram model, also employing a modified version of WaveNet that functions as a neural vocoder.

The final dataset our machine learning engineers used to replicate Rogan’s voice consists of eight hours of clean audio and transcript data, optimized for the task of text-to-speech synthesis. The eight-hour dataset contains approximately 4000 clips, each consisting of Joe Rogan saying a single sentence. The clips range from 7–14 seconds long respectively. You can find a more in-depth blog post about the technical underpinnings of RealTalk we released earlier this summer here.

The following spectrogram was generated by the RealTalk model to showcase a sample from the text-to-spectrogram model output. Areas highlighted in red designate normal, deep, short and long breaths respectively. In the training data, the speaker’s speech is much more flat than the following sample, so the engineers used the features discussed earlier to add extra intonations to the speech output.

Video

For the video portion of the work, we used a FaceSwap deepfake technique that many in the machine learning community are already familiar with. The main reason we chose the FaceSwap instead of other techniques is that it looked realistic and was pretty reliable, without requiring many hours of training data to employ properly. This was an important factor for us due to our expedited timelines. To use the FaceSwap technique, we would only need one high-quality and authentic video of Joe Rogan to use as source data, and an actor who resembled him for the video we would use to swap Joe Rogan’s face onto. We ultimately picked Paul, shown below, because his physique closely resembled Joe Rogan’s.

The image of our actor is on the left, used to generate the deepfake of Joe Rogan seen on the right.

Combining Audio & Video: Combining the synthetic audio and synthetic video together is the most difficult part of the process, because there are multiple disconnected steps in the process that make it hard to perfect. First off, the fake audio is generated separately from the video. The fake audio replicates the tone, breathing and pace at which the target speaks (in this case, Joe Rogan). For the video to look convincing, the actor posing as Rogan needed to match the same expressions and speed as the synthesized audio.

This presented us with a huge challenge, as the actor had to focus not only on copying the original audio exactly to ensure a lip sync, but also copying the expressions of the audio, as well as moving and behaving in a natural way. The first time either of the actors did manage to match the lip-sync, they basically just stared into one spot unblinkingly while copying the lip movements. Unless we were replicating a robot from an 80s sci-fi movie, this likely wouldn’t be very convincing.

In short, building a deepfake convincing enough to fool everyone is thankfully something that for now is still really hard to do. But this won’t be the case forever.

Conclusion

Why Building Awareness Is Crucial

Dessa may have achieved a deepfake merging audio and video at this level of high fidelity first, but we will not be the only ones to do it. Many others are working on the same technology, and future versions of this will be much faster and easier to use than ours. As mentioned above, the reality is that deepfakes can (and are already are) being used for malicious purposes.

We need to make sure as many people are aware of how advanced this technology is, and we need to do it fast. Malicious actors will move way faster than the rest of us, because they don’t care about getting things perfect and don’t have bureaucracy slowing them down. They will simply do whatever it takes to demean innocent individuals, steal money, or disrupt democracy.

Next steps: Contributing to the field of deepfake forensics

Our machine learning engineers’ development of hyperrealistic deepfakes has also led us to examine what we can do to help make detecting them easier and more reliable in the wild. Earlier today, we shared a new technical blog post from our engineers that shows how a recent dataset released by Google for deepfake detection falls short at reliably identifying fake videos found in the wild on YouTube. After demonstrating where the dataset falls short, our engineers have also provided a solution for making it more robust and functional on real-world data. This work has also been open-sourced for anyone wanting to replicate and expand on our results over on GitHub. The work we did on deepfake detection was featured today in The New York Times by Cade Metz, the newspaper’s AI Reporter. Find the article here.

Earlier this fall, we also released an article and open source tools for building systems for detecting deepfake audio, which you can find respectively on our Medium and GitHub.

For those of you reading from America: the full version of The Weekly’s documentary originally aired on FX at 10pm ET on November 24. From November 25, the documentary will also be available to Hulu subscribers.

Eventually (date also TBA), subscribers to The New York Times (including global subscribers) will also have access to the full episode on the newspaper’s digital platform here.

To reach out to us with questions about the work, you can email us at foundations@dessa.com.

Acknowledgements

A sincere thanks and huge praise to Rayhane Mama and Joseph Palermo for your tireless efforts taking on this project. Making a part-time side project into something earth shattering is an insane achievement, and you guys made me super proud. Another big thank you goes out to Hashiam Kadhim, who has since moved on from Dessa but was one of the main contributors to the first wave of RealTalk.

Giving proper thanks to everyone who helped out at Dessa to make this reality would take as long as this blog post itself, but I want to give special thanks to Stephen Piron, Vincent Wong, Matthew Killi, Alex Krizhevsky, Michael Jia, Austin Mackillop, Alyssa Kuhnert, and Mackenzie Nicol for all of their hard work that helped make this project a success.

Thank you David Barstow for your journalistic integrity, eternal curiosity about deepfake technology, and for taking time away from investigating presidents to talk to crazy engineers trying to save the world.

Another big shoutout goes to Andréa Schmidt and the production crew of The Weekly, for making sure that as many people as possible hear the important warning about deepfakes.

We also need to thank Paul Rasmussen, for learning how to act so quickly and for repeating the inane lines of our scripts so much.

And finally, a very special thank you to Joe Rogan for being such a good sport about us using his voice and face for RealTalk, for spreading awareness about deepfakes, and for *hopefully* inviting the team to his podcast!

--

--

Ragavan Thurairatnam

Ragavan is the Co-founder & Chief of Machine Learning at Dessa, an AI company in Toronto. He does *not* support vector machines.