How the “Access Hollywood” tape could have been faked using deep learning

Andreas Kirsch
4 min readDec 2, 2017

--

The infamous “Access Hollywood” tape triggered a huge political earthquake during the 2016 Presidential campaign. President Trump has recently hinted that it might have been fake.

What would actually be needed to fake it?

To fake it, one could try to find someone that has a voice that is similar enough and get them to record the incriminating monologue, but this is difficult to do without other people becoming aware of it and exposing it. Or one could use modern technology to generate a voice that sounds just like the real one.

Most people won’t believe that this is possible yet and say that it sounds too much like science fiction, but the future has arrived in the form of WaveNet and similar technologies. WaveNet has been invented by a Google subsidiary, DeepMind, and can generate voice that is an order of magnitude more realistic than previous technologies. Since September 2017, Google Assistant now uses it to generate all voice answers.

WaveNet

You’ll interject now: but this is just some voice, can you train it to sound like a specific voice? Indeed, this is what happens! The current Google Assistant is based on an actual voice actor’s voice. She has recorded about 50 hours of speech that has been used to teach WaveNet how to produce a voice that sounds just like her. Here are samples that show how WaveNet can easily learn different voices:

male voice, female voice.

There are easily more than 50 hours of voice recordings available of Donald Trump speaking at campaign events and during interviews. You just need to collect them and feed them to WaveNet, and it will start to sound just like Donald Trump.

Obviously, Google and DeepMind do not explicitly emphasize these capabilities anywhere, because no one has answers to the wider questions about how to live in a world where everything can be faked convincingly with ease and most evidence cannot be trusted anymore by itself.

Easier and better by voice transfer

Now you’ll interject: but this voice lacks emotion and intonation! Even though, it sounds good, it still doesn’t sound convincing. This is because Google uses a general text-to-speech system to steer WaveNet, and it lacks an understanding of emotion and context. What if we could use a someone else’s voice instead to steer the generated voice and just transfer it over to a WaveNet that sounds like Donald Trump? One could just learn to imitate Donald Trump’s way of speaking, record that and then make it sound like his voice using WaveNet and voice transfer.

Lo and behold, research has recently been published, again by Google’s DeepMind, that shows that voice transfer can be done too and works.

Original recording and the result after voice transfer.

Thus, now, one could indeed teach WaveNet to sound like the President of the United States, record anything they want and make that recording sound it was by the President with ease.

Now make it fast and available to everyone

The last interjection will be: but this will take super long and cost way too much time and money to do; moreover, this is just research anyway, right? Well, the original WaveNet was published in 2016. Back then, to generate a voice speaking for 1 minute, it would take 50 minutes on Google’s machines. Within a year, engineers and researchers at Google and DeepMind managed to speed this up by 1000x, so now generating 1 minute of speech takes 3 seconds. The results on how to do this have been published recently, too. Obviously, Google has access to superior hardware and machines, but in general, performance of computers doubles every two years or so, so soon enough everybody will have access to the resources to do this without much effort.

Now let me ask: how long will it take for any guy (or girl) with a computer to do something like that and mess with the news?

Bonus: generating fake video along with

Now that we have covered how to generate a fake voice that sounds convincing, let’s look at generating video. Again, this has been looked at already by scientists and there are some astonishing results. In this case, they looked at how to track the target’s head in an actually recorded video and how to generate lip movements for provided audio, and the insert that into the existing recording. Have a look for yourself:

This video is actually showing a generated video of former President Obama saying things he didn’t say in the original recording. Now imagine the possibilities when combined with fake speech generated by WaveNet!

Parting words

If you’re not sufficiently convinced by what is possible today, just wait another year or two and be ready to be convinced then. Personally, I think the “Access Hollywood” recording is not fake (also given the circumstantial evidence surrounding the recordings). Next time something like this shows up in a year or two, however, I’ll definitely be more cautious about anything that I haven’t seen or heard myself, and so should you! In general, this is a not a good place to be in as a society, and while conspiracy theorists have reasons to rejoice, we need to tackle the erosion of trust in publicly available information and prevent any further division of our society in separate belief systems that hinder discourse and compromise.

I hope you enjoyed reading this and let me know what you think!

--

--

Andreas Kirsch

DPhil student at AIMS in Oxford; former RE at DeepMind, former SWE at Google; fellow at Newspeak House.