Deepfakes and Deep Video Portraits — What are they and what is the difference?

You undoubtedly know what a Deepfake is, but you may not know what it’s called or what it truly means. You may not know that there are different types of technology that can create fake videos and voices.

We are all familiar with “photoshopped” images, and how rampant they have become. We are also acquainted with the video effects (VFX) and special effects that have been used in films for decades. But a new age of fabricated media is upon us thanks to “Deepfakes.”

Deepfake and Deep Video Portrait technology are two similar but different techniques used in Hollywood movies, YouTube videos, and yes, pornography. But what is this technology really and how does it all work?

If you have not yet seen a video where Nicolas Cage’s face has been superimposed over another film actor, then you have almost definitely seen one of the many social media “filters” or “masks” that can turn you into a cat, add a chef’s hat to your head, or make you a unicorn.

Instagram Filter — Photo by mikemirabal360

Perhaps you have seen BuzzFeed’s video in which comedian and impressionist Jordan Peele demonstrates how someone’s face (in this case, former President Barack Obama) in a video can be manipulated so as to appear to say something it never did.

”We’re entering an era in which our enemies can make anyone say anything at any point in time.” https://www.youtube.com/watch?v=cQ54GDm1eL0

Maybe you have even heard about Wonder Woman star Gal Gadot supposedly appearing in an adult video, which was faked by a Reddit user named “deepfakes.” Gadot’s face was superimposed onto a porn star’s body last December, which became one of the first widely discussed Deepfakes.

So what is a “Deepfake”?

At the core of Deepfakes is what you might think of as “face swapping”.

A Deepfake is an AI-assisted video created by taking a number (usually hundreds or thousands) of photos of a source person. These images can be downloaded from a number of sources, such as the person’s Instagram, Facebook, Snapchat, or even a Google image search).

The Deepfake AI software maps the face of the source images and generates a 3-D face model based on the photos it is fed. The model maps out the boundary and features of the target actor’s face:

Source: https://hackernoon.com/building-a-facial-recognition-pipeline-with-deep-learning-in-tensorflow-66e7645015b8 by Cole Murray

The software is also given a source video, containing a target face which the user wants to replace. The AI also maps the face of the person in the video, again creating a 3-D mapped model.

CMU Associate Research Professor Simon Lucey uses himself as an example to show off his facial mapping software developed for an online eyeglass retailer. Credit: Simon Lucey / CMU

This is where the AI starts to match the source model to the target model. It “learns” the faces via the images it’s given (training data), which looks a bit like this:

Face-swap training model example — Elon Musk to Jeff Bezos by Adi Robertson, Source: https://www.theverge.com/2018/2/11/16992986/fakeapp-deepfakes-ai-face-swapping

Then, the AI superimposes the generated 3-D face from the source photos over the target video’s 3-D model and outputs a video in which the movements of the face, mouth, eyes, etc. match up, working within the bounds of the original face.

The source of a Deepfake is a series of still photographs and the result of a Deepfake is a video with a replaced face, such as these examples:

Several examples of Deepfakes

How is that different from a Deep Video Portrait?

The difference between a Deepfake and a Deep Video Portrait (DVP, for brevity’s sake) lies in two key distinctions:

  1. The output video from a DVP does not replace the face, only manipulates the features
  2. The source for a DVP originates from a live-action actor, not from individual photographs.

DVP is not face swapping. It is facial manipulation. Video puppetry.

The linked video earlier in this article showing Obama talking about fake videos is an example of a DVP, not a deepfake. There is an actor whose face is mapped, and, because you are not replacing the target face but only making the target face move, the result can be even more believable than a photo-based deepfake.

DVP creators can do things like make the target blink, open the mouth, raise the eyebrows, and turn the head side to side based on the source actor’s movements. Deepfakes, on the other hand, cannot really stray from the original video’s movements. This is why a DVP is more believable than a deepfake.

This example explains more in-depth about how this “face capture and reenactment” technology works:

How face capture and reenactment work

A Snapchat or Instagram filter mask is a DVP, not a Deepfake. This is because it’s you (the target doesn’t change faces), but your face has been mapped and the app simply overlays something over your own face:

How a Snapchat filter works

Voice Fakes and Deep Video Portraits

There is another type of fake content that has recently become better and more available — voice generation.

At the Adobe Max Creativity Conference in 2016, Adobe demonstrated VoCo: an audio suite that can help users make people say whatever they want. Think of text-to-speech, but based on someone’s real voice.

According to the company, 20 minutes of listening input can allow VoCo to output a realistic vocal track that sounds like the source. The output is generated via the computer running the software.

Adobe VoCo demo at Adobe Max Creativity Conference 2016

Adobe VoCo hasn’t really been heard from since 2016, perhaps after privacy and identity concerns were raised. It was presented at an “ideas forum,” not announced as a new product. This generated interest, excitement, and discussion, but offered no specific expectation of release.

Now that the idea and the technology exist, so naturally, other companies have released their own version of voice-generating technology. Lyrebird has released a service which generates a “vocal avatar” for you based on only 30 sentences of input speech (versus the approximate 20 minutes of data needed for VoCo).

Whereas VoCo required local computing resources to generate its output, Lyrebird uses scalable cloud resources, making output generation significantly faster. Lyrebird also requires 30 specific sentences, versus 20 minutes of basic speech patterns, which could curtail spoofing and other vulnerabilities.

Combined with a DVP, a voice fake can increase believability. This is because you’re not listening to an impression of someone, which could give away the fakeness, but rather hearing a much closer representation based on the target person’s own voice.

Hybrid Technology

FaceSwap is an app that lets you essentially combine the face swapping of a Deepfake, but in real-time, with your own expressions puppeteering the other person’s face. It combines Deepfake and DVP.

FaceSwap Live is a hybrid of Deepfake and DVP technology

Conclusion

These technologies will continue to improve. Although many uses are fun and whimsical, the effect this technology will have is significant.

Deepfakes and DVPs will without question have wide-ranging impacts on our views of reality, trust, and privacy. However, a discussion of the ethics, issues, and societal impacts (good and bad) is well beyond the scope of this article.

For now, the only solution is to (continue to) be skeptical of all that you see and hear.

Photo by Mikes Photos from Pexels