DeepFakes - Seeing is No Longer Believing

Ruichong "Alex" Wang
Thomson Reuters Labs
17 min readMay 15, 2024

Ruichong “Alex” Wang alex.wang@thomsonreuters.com

DeepFakes - Seeing is No Longer Believing

Cover picture generated by DALL·E

In a world where seeing is no longer believing, DeepFakes have emerged as a powerful and controversial technology. This article dives into what DeepFake are, how they’re made, and why they matter. We’ll start by defining DeepFake and the technology behind them, showing how anyone can create convincing fake videos. Then, we’ll discuss the major concerns they bring up, including issues of privacy, misinformation, and security. Our goal is to give you a clear understanding of DeepFake, highlighting both their potential and the risks they pose to society. We then pivot to examine legitimate and potentially beneficial applications of DeepFake technology, presenting a balanced view that acknowledges both the groundbreaking opportunities and the ethical quandaries they present. Our objective is to equip you with a nuanced understanding of DeepFake, enabling you to appreciate the duality of their potential as both a tool for innovation and a weapon for misinformation.

Original Video of me talking about DeepFake:

Original Video of me talking about DeepFake

DeepFaked Video of me talking about my hobbies, there are still some flaws in the mimicked video but it looks quite real for the most part:

DeepFaked Video of me talking about my hobbies

What is DeepFake?

Definition and Overview

DeepFake technology has introduced a new era in digital content creation. Defined as hyper-realistic digital manipulations, DeepFake can convincingly mimic real people’s appearances and voices. Their initial use in entertainment and satire has rapidly expanded into more controversial applications, raising significant ethical and societal concerns​​​​.

The Emergence of DeepFake Technology

DeepFake technology’s roots lie in the significant advancements made in artificial intelligence and machine learning, most notably through the development of Generative Adversarial Networks (GANs). The term ‘DeepFake’ became widely recognized in 2017, marking a pivotal moment in raising public awareness about the technology’s capabilities and potential dangers. This era introduced the ability to manipulate video and audio content so convincingly that distinguishing between real and fabricated content became increasingly challenging, igniting widespread ethical, legal, and security discussions.

Democratization and Misuse of DeepFake

The progression of DeepFake technology has been rapid, with tools becoming more accessible and allowing a broader range of individuals to create fake content with alarming sophistication and minimal effort. This democratization has significantly increased the potential for misuse, leading to concerns about the spread of disinformation, the manipulation of public opinion, and the undermining of digital media trust. The global community is now at a crucial crossroads, necessitating a unified effort from technology creators, policymakers, and citizens to minimize negative impacts while exploring potential positive applications.

Public Perception and Regulatory Response

Initially met with mixed reactions of intrigue and concern, DeepFakes have quickly become a contentious issue within the realms of politics, cybersecurity, and personal privacy. The technology’s capacity for creating believable disinformation has profound implications, emphasizing the need for heightened awareness and regulation. It’s essential to understand that generating and distributing non-consensual deepfakes breaches privacy and is considered unethical and illegal in many places worldwide.

Several countries and regions have enacted laws to address the creation and distribution of DeepFakes:

  • United States: The NDAA has introduced provisions to combat DeepFakes, with specific acts mandating research into the technology and authenticity measures. States such as Texas, Virginia, and California have passed laws targeting DeepFake misuse in elections, non-consensual pornography, and the creation of doctored political videos, respectively.
  • European Union: Proposed EU regulations aim to hold social media companies accountable for removing DeepFakes and other disinformation, with potential fines reaching up to 6% of global revenue for non-compliance.

These laws, although a step in the right direction, vary significantly by jurisdiction and are constantly evolving to keep pace with technological advancements. The enforcement of these regulations presents its own set of challenges, particularly in tracing the origins of DeepFakes, which remains a critical issue in the ongoing battle against this disruptive technology.

Responsible Demonstration: Choosing Ethical Paths in DeepFake Presentations

In light of the legal and ethical considerations surrounding DeepFake technology, careful thought must be given to its use in demonstrations and public displays. Originally, we contemplated using a DeepFake video of former President Barack Obama to showcase the capabilities of this technology. However, upon careful review of the potential legal implications and out of respect for the integrity and image rights of public figures, we decided against this approach.

Instead, we chose to use a video of myself as a demonstration. This decision ensures full compliance with legal standards and eliminates the ethical quandaries associated with the unauthorized use of a public individual’s likeness. By using my own image and voice, we maintain control over the content and can provide a clear and transparent example of DeepFake’s potential without overstepping any legal boundaries. This approach also serves as a responsible demonstration of how DeepFake technology can be used in a manner that is both ethical and respectful of individual rights.

Our commitment to ethical practices in the demonstration of DeepFake technology is a testament to our recognition of the profound impact that this technology can have on society. It underscores our dedication to promoting a responsible use of AI, where innovation is balanced with the preservation of trust and respect for individual privacy and consent.

Further Readings:

How can one Generate a DeepFake Videos

In this section, we will talk about how to generate a DeepFake Video. We also included Colab notebook in this section for anyone who wants to have a hands-on experience about it.

In General, it takes 2 steps to create a DeepFake Video:

  1. Generate Cloned Voice
    - Collect Audios (Preferred audios have low background noise, and clear voice)
    - Voice Imitation using TTS
  2. Lip Sync
    - Collect Videos (Preferred videos have high resolution, clear visuals of the mouth, and minimal movement of the body and hands.)
    - Speech to Lip Generation using Wav2Lip

Generate Cloned Voice

By utilizing the TTS package, after inputting a text script and an audio file, it will generate a cloned voice. While cloning the voice, there are some tricks that can improve the quality of the cloned voice.

  1. Avoid long sentences: In our experiences, using long sentences can make the voice sound more robotic, with unnatural pauses or too fast talking speeds.
    - Solution: Create longer narratives by breaking the text into smaller segments, utilizing commas, periods, and other punctuation marks that signify the end of a sentence.
  2. Use high-quality audios: Many audios available online, such as recordings from speeches, podcasts, or videos, often contain background noise like applause or music, which can decrease the quality of the cloned audio.
    - Solution: Use audio file enhancement platforms (many are available for free online) to remove such “noise” for improved output quality.

Setting Up Your Environment

First, ensure your environment is prepared for generating cloned voices by installing the necessary Python packages. For beginners: You can run the experiment on Google Colab for the free GPUs.

pip install pydub TTS
mkdir -p data # Ensure the data directory exists

Import the necessary Python libraries to start processing:

import re
import os
import torch
from TTS.api import TTS
from pydub import AudioSegment
from datetime import datetime

Processing the Text

Split the input text into manageable chunks. This step helps in making the cloned voice sound more natural and is essential for avoiding the robotic tone that can occur with longer sentences.

def split_text(text):
"""
Splits and combines the input text into manageable chunks based on punctuation
and word count, ensuring each chunk is suitable for TTS processing.

Parameters:
- text: The input text to be split.

Returns:
- A list of processed text chunks.
"""
# Normalize space and split the text by specified punctuation, capturing the punctuation in the result
chunks = re.split('([,.!?])', text.replace('\n', ' ').strip())

# Initialize a list to hold combined chunks
combined_chunks = []

# Temporarily store each chunk before it's added to the combined_chunks list
temp_chunk = ""

for i in range(0, len(chunks), 2):
# Combine chunk with its trailing punctuation, if present
chunk_with_punctuation = chunks[i].strip() + (chunks[i + 1] if i + 1 < len(chunks) else '')

# Append the chunk to the temp_chunk, adding a space if temp_chunk is not empty
temp_chunk += (" " if temp_chunk else "") + chunk_with_punctuation

# Check if the current temp_chunk has at least 4 words or if it's the last chunk
if len(re.findall(r'\b\w+\b', temp_chunk)) >= 4 or i + 2 >= len(chunks):
combined_chunks.append(temp_chunk.strip())
temp_chunk = "" # Reset temp_chunk for the next iteration

return combined_chunks

Setting Up TTS

Initialize the TTS model and set the device based on the availability of CUDA for GPU acceleration. This step ensures optimal performance during voice generation.

model = "tts_models/multilingual/multi-dataset/xtts_v2"
device = "cuda" if torch.cuda.is_available() else "cpu"
tts = TTS(model).to(device)
print('Cloning on:', device)

Generating Voice Chunks

For each text chunk, generate an individual voice clip. This modular approach allows for more natural sounding pauses and intonations, as each chunk is processed independently.

def tts_each_chunk(chunks, tts, speaker_wav, language, output_dir, topic):
"""
Generates TTS audio files for each text chunk.

Parameters:
- chunks: List of text chunks to be processed.
- tts: The TTS model object used for generating speech.
- speaker_wav: Path to the speaker's wav file used as a reference for voice cloning.
- language: Language code for the TTS synthesis.
- output_dir: Directory where the generated audio files will be saved.
- topic: Topic of the text, used for naming the audio files.

Returns:
- A list of file paths to the generated audio files.
"""
audio_files = []
for index, chunk in enumerate(chunks):
if not chunk.strip():
# Skip empty or whitespace-only chunks
continue

# Format the output file path with the topic, chunk index, and file extension
output_file_path = os.path.join(output_dir, f"{topic}_chunk_{index}.wav")

# Generate TTS audio file for the chunk
tts.tts_to_file(text=chunk, speaker_wav=speaker_wav, language=language, file_path=output_file_path)

# Append the output file path to the list of audio files
audio_files.append(output_file_path)

return audio_files

Merging Audio Files

After generating the individual voice clips, merge them into a single audio file. This step involves blending the clips together with a crossfade effect for smoother transitions between segments.

def merge_audio_files(audio_files, final_output_file, crossfade_duration=100):
"""
Merge audio files with a crossfade effect.

:param audio_files: List of file paths to the audio files.
:param final_output_file: File path for the merged output file.
:param crossfade_duration: Duration of the crossfade in milliseconds. Default is 500ms.
"""
if not audio_files:
raise ValueError("No audio files provided for merging.")

# Load the first audio file
combined = AudioSegment.from_wav(audio_files[0])

# Iterate over the rest of the audio files and merge with crossfade
for file in audio_files[1:]:
next_audio = AudioSegment.from_wav(file)
combined = combined.append(next_audio, crossfade=crossfade_duration)

# Export the combined audio
combined.export(final_output_file, format="wav")

Execution Workflow (Run Everything!)

Bring all the components together: split the text into chunks, generate voice clips for each chunk, and merge all clips into the final audio file. This comprehensive approach allows for generating a high-quality cloned voice that mimics the input audio’s tone and intonation as closely as possible.

# Pre-process the input text to split into manageable chunks
chunks = split_text(text.strip())

# Prepare settings for TTS process
now = datetime.now()
formatted_time = now.strftime("%Y%m%d_%H%M%S")
file_name = os.path.basename(local_wav_file).split('.')[0]

# Construct the output directory dynamically based on input file path and current time
output_dir = os.path.join(os.path.dirname(local_wav_file).replace('/data/', '/output/'), formatted_time)

# Ensure the output directory exists, create if it does not
if not os.path.exists(output_dir):
os.makedirs(output_dir) # Using makedirs to ensure creation of nested directories if needed

# Define the final output file path for the merged audio
final_output_file = os.path.join(output_dir, f'final_output_{file_name}_{formatted_time}.wav')

# Process each text chunk to generate individual audio files
audio_files = tts_each_chunk(chunks, tts, local_wav_file, "en", output_dir, topic)

# Merge the generated audio files into one final output file
merge_audio_files(audio_files, final_output_file)

print('Voice cloning complete!')

Congratulations! You’ve successfully generated a cloned voice using the TTS package. This process, from text processing to audio file merging, allows for the creation of natural-sounding cloned voices suitable for various applications, including video narrations, audiobooks, and more.

Remember to experiment with different text lengths, punctuation, and audio enhancement tools to achieve the best possible quality for your cloned voice.

When selecting audio samples for tone matching, it’s important to ensure that the tone of the original recording aligns with that of the target content.

Lip Sync

In this section, we delve into using the Wav2Lip package, a powerful tool designed to synchronize lip movements in a video with any given audio input. This technology offers an impressive way to create realistic videos where the subject appears to be speaking the audio content.

To ensure the highest quality in your cloned videos, it’s crucial to consider some practical tips when selecting your video and audio pair:

  1. Choose Single-Face Videos: For optimal results, select videos featuring only one visible face. Wav2Lip excels in syncing a single subject but may struggle with accuracy when multiple faces are present, potentially syncing the wrong person’s lips.
  2. Minimize Large Movements: Videos with significant body or hand movements can pose a challenge. Large movements, unless perfectly matched with the audio’s emotional tone, can detract from the realism of the final video. For a more authentic appearance, opt for footage where the subject maintains a relatively steady posture.

Implementing these strategies can significantly enhance the believability of your cloned videos, making the subject’s spoken words appear seamlessly integrated with their lip movements.

How to Lip Sync?

The Wav2Lip package is designed to make anyone speak anything by generating lip-sync videos from an input video and audio. The package is available on GitHub, and the team has also provided a detailed research paper explaining the underlying principles and techniques used in this project.

To get started, we’ll be using the simplified version of the Colab notebook created by Justin John. This notebook provides a user-friendly interface for lip-syncing videos with audio.

Here’s a step-by-step guide on how to use the notebook, we will use youtube video as an example:

Setup Wav2Lip

The first step is to set up the Wav2Lip package by installing the necessary dependencies and downloading the pre-trained models. This can be done by running the following code block:

Dependency Installation

or, the following code snippet:

mkdir /content/sample_data

git clone https://github.com/justinjohn0306/Wav2Lip

cd /content/Wav2Lip

#download the pretrained model
wget 'https://github.com/justinjohn0306/Wav2Lip/releases/download/models/wav2lip.pth' -O 'checkpoints/wav2lip.pth'
wget 'https://github.com/justinjohn0306/Wav2Lip/releases/download/models/wav2lip_gan.pth' -O 'checkpoints/wav2lip_gan.pth'
wget 'https://github.com/justinjohn0306/Wav2Lip/releases/download/models/resnet50.pth' -O 'checkpoints/resnet50.pth'
wget 'https://github.com/justinjohn0306/Wav2Lip/releases/download/models/mobilenet.pth' -O 'checkpoints/mobilenet.pth'

pip install https://raw.githubusercontent.com/AwaleSajil/ghc/master/ghc-1.0-py3-none-any.whl
pip install git+https://github.com/elliottzheng/batch-face.git@master

pip install ffmpeg-python mediapipe==0.8.11

Select a Video

The next step is to select a video that you want to lip-sync. You can either use a YouTube video or a local video file from your computer or Google Drive. The notebook provides a convenient way to download and trim videos from YouTube. If you’re using a local video file, you can directly upload it to the notebook. You can specify the start and end times for trimming the video to focus on the desired section.

Video Trimming

Select Audio

You can either record audio directly in the notebook, upload an audio file from your local drive or Google Drive, or provide a custom path to an audio file on your Google Drive. In this section, we have uploaded the combined audio files referenced previously.

Audio Upload

Start Crunching and Preview Output

In this step, you can adjust various parameters like padding, resize factor, and smoothing options. You can also select the model you want to use (regular or high-definition). Once you’ve set the desired parameters, run the code to start the lip-syncing process. After the process is complete, you can preview the output video directly in the notebook.

Lip-syncing Process Output

Sample Result can be found below. Original Video of me talking about DeepFake:

Original Video

DeepFaked Video of me talking about my hobbies, there are still some flaws in the mimicked video but it looks quite real for the most part:

DeepFaked Video

Concerns about DeepFake

The Dark Side of DeepFake Technology

While DeepFake technology showcases remarkable innovation, it also raises significant concerns. Its misuse poses alarming threats across misinformation, privacy, legal challenges, cybersecurity, and societal impacts. The sophistication of DeepFake blurs the lines between reality and fabrication, necessitating a critical examination of its darker implications. As the previous section illustrates, it’s relatively easy for anyone to create a DeepFake, further compounding these concerns.

Misinformation and Propaganda

The capacity of DeepFake to generate convincing yet entirely fabricated videos and audio clips of public figures represents a profound threat to the integrity of information. This is particularly egregious in the political sphere, where fabricated content can mislead voters and distort democratic processes. An illustrative case occurred during the 2024 election season, where a political party fell victim to a disinformation campaign driven by DeepFake content, illustrating the direct harm to reputations and the manipulation of public opinion.

Further Reading:

Privacy and Identity Fraud

DeepFake encroach on personal privacy and are instrumental in identity fraud, a trend that has disturbingly doubled in prevalence from 2022 to the first quarter of 2023. These sophisticated forgeries enable scammers to impersonate individuals convincingly, facilitating malicious activities ranging from financial fraud to personal harassment, thereby undermining the security of personal identity in the digital age.

Further Reading:

Legal and Regulatory Challenges

As DeepFake technology evolves, so too does the legal and regulatory framework designed to mitigate its misuse. Nations worldwide are revising laws and fostering international collaborations to combat the cross-border nature of cyber threats posed by DeepFake. Yet, the delicate balance between curbing malicious uses and preserving free expression, coupled with the technical hurdles in enforcement, presents ongoing challenges.

Further Reading:

DeepFake in Cybersecurity

DeepFake have emerged as a formidable cybersecurity threat, particularly in corporate environments. Cybercriminals adeptly use DeepFake technology to mimic company executives in sophisticated phishing schemes, leading to substantial data breaches and eroding trust within organizations. The urgency for businesses to bolster defenses against misinformation and implement robust authentication measures has never been greater.

Further Reading:

Societal and Psychological Effects

The proliferation of DeepFake fuels a phenomenon known as ‘reality apathy’—a growing skepticism towards the authenticity of any information. This erosion of trust extends beyond societal discourse, deeply affecting individuals who find themselves unwillingly featured in DeepFake content. The psychological ramifications can be severe, impacting personal well-being and societal cohesion.

Further Reading:

Legitimate Applications of DeepFake Technology

DeepFake technology has garnered significant attention for its potential for misuse, yet it harbors a wealth of opportunities for beneficial applications across diverse sectors. This exploration aims to shed light on the positive uses of DeepFake, highlighting how they can innovate and enhance fields such as film and television, artistic expression, media personalization, and assistive technologies. As we navigate these possibilities, it’s crucial to also address the ethical and legal challenges that accompany this advancing technology.

Educational Tools and Simulations

DeepFake technology can revolutionize education by creating immersive and interactive learning experiences. Through realistic simulations and reenactments, students can explore historical events, scientific phenomena, and cultural experiences in a way that traditional textbooks cannot offer. For instance, DeepFake can be used to simulate historical speeches or to create virtual lab experiments, allowing students to witness and interact with historical figures or to observe scientific experiments in a controlled, virtual environment. This application has the potential to enhance student engagement, improve learning outcomes, and make education more accessible by providing a rich, immersive learning experience that transcends geographical and temporal boundaries.

Film and Television Enhancement

DeepFake technology offers a cost-effective and creative alternative to traditional CGI in film and television, allowing for the de-aging of actors, resurrection of historical figures, and seamless dubbing in multiple languages. These applications not only broaden the creative landscape but also prompt important discussions on copyright, consent, and the potential for misinformation. The use of DeepFake to recreate Val Kilmer’s voice in recent projects exemplifies its potential, yet underscores the need for clear ethical guidelines and consent protocols.

Further Reading:

Artistic and Creative Expression

DeepFake technology is redefining the boundaries of art and entertainment, offering artists like Gillian Wearing and institutions like the Salvador Dalí Museum innovative tools to engage with audiences. By creating interactive experiences and exploring identity through digital means, these applications push us to reconsider our perceptions of reality and authenticity in the digital age. However, it’s vital that these artistic endeavors are approached with a consideration for ethical implications, ensuring that creative exploration does not veer into manipulation or misinformation.

Further Reading:

Media Personalization

In the realm of digital marketing, DeepFake technology promises unprecedented levels of personalization and engagement, potentially revolutionizing how brands connect with their audiences. From adapting spokespersons’ appearances to suit diverse demographics to offering personalized language options, DeepFake can make advertising more inclusive and effective. Nevertheless, this level of personalization must be balanced with respect for privacy and consent, ensuring that marketing practices remain transparent and ethical.

Further Reading:

Assistive Technologies

DeepFake technology holds profound promise for assistive technologies, offering individuals who have lost their ability to speak a chance to communicate in their own voice again. This application exemplifies the potential of AI to improve lives but also raises significant privacy and ethical concerns. Developing and implementing strict ethical guidelines and consent mechanisms is crucial to ensure that this technology is used responsibly and with the utmost respect for the individuals it aims to assist.

Further Reading:

Conclusion

DeepFake technology presents a paradox of incredible potential and profound risks. Its applications range from enriching creative industries and enhancing communication to posing serious threats like misinformation, identity fraud, and cybersecurity challenges. The duality of DeepFake demands a nuanced approach that embraces innovation while actively mitigating its adverse effects.

A critical step forward is cultivating an ecosystem where ethical use, transparency, and robust regulation are paramount. This involves critical scrutiny from individuals, ethical considerations from creators, and proactive governance from policymakers. The technical insights into DeepFake creation offered in this blog highlight the urgency for all stakeholders to wield this powerful tool with care and responsibility.

In navigating the DeepFake landscape, our collective efforts must focus on leveraging its benefits to advance human creativity and inclusivity, while safeguarding against its potential to undermine trust and security. The future of DeepFake technology hinges on our ability to balance its innovative capabilities with a steadfast commitment to ethical and responsible use.

References

  1. AMT Lab @ CMU. (2021). Positive Implications of DeepFake Technology in the Arts and Culture. https://amt-lab.org/blog/2021/8/positive-implications-of-DeepFake-technology-in-the-arts-and-culture
  2. AMT Lab @ CMU. (2024). DeepFake Technology in the Entertainment Industry: Potential Limitations and Protections. https://amt-lab.org/blog/2020/3/DeepFake-technology-in-the-entertainment-industry-potential-limitations-and-protections
  3. CMsmartvoice. (n.d.). One-Shot Voice Cloning. GitHub. https://github.com/CMsmartvoice/One-Shot-Voice-Cloning
  4. coqui-ai. (n.d.). Coqui AI TTS (Text-to-Speech). GitHub. https://github.com/coqui-ai/TTS
  5. CorentinJ. (n.d.). Real-Time Voice Cloning. GitHub. https://github.com/CorentinJ/Real-Time-Voice-Cloning
  6. Gilpin. (2018). Why “reality apathy” is the next battle in brand reputation. https://sproutsocial.com/insights/reality-apathy-brand-reputation/
  7. Hyscaler. (2024). Cyber Threats in 2024: Navigating the DeepFake Dilemma. https://hyscaler.com/insights/cyber-threats-in-2024-DeepFake-dillema/
  8. Hyscaler. (2024). DeepFake Technology: Unmasking the Ethical Challenges in a World of Digital Illusions. https://hyscaler.com/insights/DeepFake-technology-trend-2024-beyond/
  9. Journalists Resource. (2023). DeepFake technology: These 5 resources can help you keep up. https://journalistsresource.org/politics-and-government/DeepFake-technology-5-resources/
  10. Metaroids. (2024). DeepFake in Movies: The Future of Filmmaking. https://metaroids.com/feature/DeepFake-in-movies-the-future-of-filmmaking/
  11. Murf AI. (2023). Everything You Need to Know About DeepFake Voice. https://murf.ai/resources/DeepFake-voices/
  12. Respeecher. (2023). Everything You Need to Know about DeepFake Voice and Its Synthetic Voice Ethical Counterpart. https://www.respeecher.com/blog/everything-you-need-know-about-DeepFake-voice-its-synthetic-voice-ethical-counterpart
  13. Rudrabha. (n.d.). Wav2Lip. GitHub. https://github.com/Rudrabha/Wav2Lip
  14. Search Engine Journal. (2022). DeepFake Technology Pros & Cons For Digital Marketing. https://www.searchenginejournal.com/DeepFake-technology-digital-marketing/454395/
  15. The Cyber Express. (2024, January 6). DeepFake Gone Wild: 10 Trends That Will Reshape 2024. https://www.thecyberexpress.com/DeepFake-technology-trends-for-2024
  16. VFXwire. (2024). DeepFake And What It Could Mean For The Film Industry. https://www.vfxwire.com/DeepFake-and-what-it-could-mean-for-the-film-industry/
  17. WIPO. (2022). Artificial Intelligence: DeepFake in the Entertainment Industry. https://www.wipo.int/wipo_magazine/en/2022/02/article_0003.html

--

--