Deepfakes and synthetic media are some of the most feared things in journalism today. Indeed, it is very worrisome that people would base their reporting on false information created by AI. In this article, we’ll explain some of the methods being used to create synthetic media, and wonder whether there are any ways in which it could be used for good. This guide is intended for those without a background in artificial intelligence.
What is synthetic media?
AI-based models can now produce and manipulate audiovisuals with an extremely realistic outcome. The result of this process is a new category of images, text, audio, videos, and data generated by algorithms called synthetic media. It is possible to generate faces and places that don’t exist and even create a digital voice avatar that mimics human speech.
“Do as I do” motion transfer, which is the transfer of body movements from a person in a source video to a person in a target video is a type of synthetic media:
How is synthetic media created?
The creation of synthetic media happens through generative artificial intelligence. The three most common types of this are Generative Adversarial Networks (GAN), Variational Autoencoders, and Recurrent Neural Networks.
Generative Adversarial Networks use two neural networks (a neural network is a computing system that can predict and model complex relationships and patterns) that compete against each other. The first network — the generator — creates new content based on a dataset. The second network — the discriminator — assesses whether the content is fake or real. As the discriminator identifies the content as fake, the generator refines its creations. Over time the generator becomes better at creating content (generally images) that seems real.
Variational autoencoders, however, are most commonly used when making digital artwork or video. In this method, an encoder (a neural network) takes an input and converts it to a compressed representation. Then a decoder (another neural network) reconstructs the content. The decoder includes probability modeling that identifies likely differences between the two so it can reconstruct elements that would otherwise get lost through the encoding-decoding process.
A third common procedure, named “recurrent neural networks,” is designed to recognize characteristics and patterns among a dataset to predict the most likely next scenario. By recognizing the structure on a large set of text, the algorithm can predict the next word in a sentence. This is how autocomplete features work and it’s typically the methodology used in text generation.
Artificial images that look like photography can be produced using a deep learning model that takes silhouettes and translates them into an image after processing a digital dataset. The name of this process is image translation: it turns the input into a real-looking image.
This is the technology behind NVIDIA projects like thispersondoesnotexist.com or GauGAN. In the first one, an AI model creates a headshot that is not real, but is a synthesis of a dataset of human faces. In the second one, a person can generate a realistic landscape or scene by just drawing a sketch. In both cases, the technology used is a GAN model.
With the same technology, NVIDIA has experimented with video-to-video translation to create high-resolution, realistic, temporally coherent video.
Other examples of synthetic video, like the Face2Face project, include algorithms that detect the structures of data of poses and motion in a video of a human face. Then, it creates a map of the places where it can transpose that data to another human face, like a digital mask. This process can be used to generate realistic reenactments.
Synthetic voices are already being implemented by virtual assistants like Alexa or Siri, who turn text into audio and mimic human speech. There are other techniques that produce more realistic results. Deep learning algorithms can generate human-sounding voices by learning from data representations of real people’s speeches.
Project Revoice, for example, is a non-profit initiative that recreates the voice of people with degenerative disease amyotrophic lateral sclerosis for future use, as one of the symptoms of this disease is the loss of a person’s ability to speak.
“A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown.
The incident occurred on the downtown train line, which runs from Covington and Ashland stations. In an email to Ohio news outlets, the U.S. Department of Energy said it is working with the Federal Railroad Administration to find the thief.”
The first paragraph of this text was written by a human, but the second one was continued by a machine. Early this year, Open AI presented its GPT-2 model, which was trained to predict the next word given all of the previous words within some text. This model can also translate between languages and answer questions. Open AI did not release the technology citing the potential for malicious use of the technology, including the potential for generating misleading news articles. The company reports it is following a staged release to conduct risk and benefit analyses as it increases the size of the model made available.
The most recent follow-up was published this August and includes an analysis of the social impact of this technology: “Our threat monitoring did not find evidence of GPT-2 direct misuse in publicly-accessible forums but we did see evidence of discussion of misuse.”
While the implications of synthetic media are just starting to be understood, the creation of this content is already causing journalists to be more cautious and put safeguards in place.
Synthetic media used for malicious intent — deepfakes — has been the most discussed synthetic media to date. Deepfakes first came to mainstream attention when Vice reported in 2017 on the emergence of pornographic videos altered with the use of AI algorithms to insert the faces of famous actresses. Since then there has been significant reporting on deepfakes as well as use of synthetic media for audience education. For instance in 2018, Buzzfeed circulated a video showing President Barack Obama talking about the risks of manipulated videos — but it was an AI generated video using Jordan Peele’s voice and Obama’s likeness.
Beyond reporting on synthetic media, newsrooms are focusing on synthetic media detection and validating information. The Wall Street Journal, for instance, created a newsroom guide and committee to detect deepfakes. The New York Times recently announced that is exploring a blockchain-based system to fight misinformation online.
Many companies and universities are also looking to detect synthetic media. Adobe and UC Berkeley, for example, have recently announced that they are working on a method “for detecting edits to images that were made using Photoshop’s Face Aware Liquify feature.” This work is sponsored by the Media Forensics program of the Defense Advanced Research Projects Agency. Amber, a startup for video verification technology, identifies maliciously-altered audio and video by using signal processing and artificial intelligence. Deeptrace, a deepfakes detection company, aims at building “the antivirus for deepfakes” and developing deep learning algorithms that detect synthetic or altered audiovisual media.
While the initial focus of the discussion around synthetic media has been on understanding and assessing the risks of this content in misleading the public and misleading journalists, we are seeing nascent considerations about how synthetic media could be used to support the news industry.
An example in the audio space is the company Descript, which is testing a new “Overdub” feature that would enable users to add and change words and whole sentences in the transcript of an audio file using a computer generated voice that sounds like the speaker. In case a podcast guest or host stated something incorrectly, for example, a producer would be able to fix the audio instead of re-recording it.
In broadcast media, we have an example from the BBC, where an English speaking newscaster appeared in Spanish, Hindi, and Mandarin in 2018. This was made possible because software captured the face motion of a person speaking in another language and then pasted it on the newscaster’s face. While this was used by the BBC as an example of how synthetic media could spread misinformation, it also suggests how synthetic media might be used in the future to create local programming in an automated way.
Xinhua’s AI anchors offer another example of broadcast journalism created without a human host. To be sure, such uses of synthetic media would be controversial within journalism and we have not seen it being done on a routine basis. However, it is an area to watch.
Aldana Vales is a fellow on the Journal’s R&D team and student in the NYU Studio 20 program.