ICYMI: Earlier this summer we broke new ground with RealTalk, a speech synthesis system created by Machine Learning Engineers at Dessa (Hashiam Kadhim, Rayhane Mama, Joseph Palermo). With their AI-powered text-to-speech system, the team managed to replicate the voice of Joe Rogan, a podcasting legend known for his irreverent takes on consciousness, sports and technology. On top of that, their recreation of Rogan’s voice is the most realistic AI voice that’s been released to date.
If you haven’t heard the voice yet, you should.
Here’s the video we shared on YouTube featuring a medley of their faux Rogan’s musings:
Our initial post focussed on the ethical implications of fake voices and AI’s growing capabilities in deceiving human senses. Since then, the public’s response to the work has wowed us.
Even Joe Rogan himself weighed in:
With that in mind, we’re sticking to our original plan and won’t be open sourcing the work to mitigate potential misuse of the technology. That said, we still wanted to provide people interested in RealTalk an overview of how it works technically at a high-level.
Following the pattern of other great inventions, RealTalk was born out of a confluence of technologies from both recent and more distant pasts. The speech synthesis field isn’t a new one: scientists and technologists have been trying to emulate human voices with machines since at least the nineteenth century.
Until very recently, though, speech synthesis that actually sounds like the real thing has been the stuff of science fiction. Deep learning has changed all of that, in only the span of a few years.
While designing their RealTalk system, the engineers were highly influenced by existing state-of-the-art systems, including Google’s Tacotron 2 and NVIDIA’s WaveGlow. In the same way as Tacotron 2, RealTalk uses an attention-based sequence-to-sequence architecture for a text-to-spectrogram model, also employing a modified version of WaveNet that functions as a neural vocoder.
To all the laypeople out there: don’t worry, we’ll get to explaining how these systems actually work in simpler terms below. And in case you’re just looking for the Cole’s Notes: jump ahead to the bottom of the article, which features an overview of the RealTalk system’s technical accomplishments.
How RealTalk Works: The Model
The RealTalk system consists of multiple deep learning systems that power its ability to produce life-like speech. The first of these does the heavy lifting, predicting how an individual speaker (in this case, Joe Rogan) would say specific words and sentences. Here’s how it works:
The process of training the first system to mimic Joe Rogan’s voice begins by feeding it a dataset that contains thousands of fragments of audio and text. RealTalk’s engineers assembled this dataset directly from the Joe Rogan Experience podcast (which we’ll talk more about a bit later on in this post). Each audio clip contains a sentence said by Rogan on one of his podcast’s episodes, paired with a transcript that spells out what he says in words.
Note: we should mention that the audio clips fed into the model aren’t actually audio clips per se. Instead, the clips are formatted a visual representation of sound called spectrograms. This will be an important distinction we’ll reference later on in our explanation of the model, so hold onto it under your hat for now!
Exposed to textual data contained in the transcripts, the model learns how to pronounce a vast range of English words correctly. Amazingly, this learning process is generalizable after training, which means that the model can learn how to pronounce words that are plugged into the system even if they aren’t in the original dataset (how else do you think we got faux Rogan to say ‘otolaryngology’ correctly?).
This feat is made possible by the fact that the model learns how to pronounce patterns of characters (or the letters that make up words) rather than the words themselves. Without human intervention, the system automatically infers what words mean by looking at a huge amount of characters, also learning the patterns that typically occur between them. In this process, each character is also mapped to a corresponding place in the output spectrogram, which visually represents how the letter would be pronounced in speech. This is a relatively standard practise in machine learning called embedding, which you can learn more about here.
From the audio clips of Joe Rogan speaking, the model learns how to reconstruct aspects of language that are referred to in technical communities as prosody. This includes elements that define individual characteristics of speech including a speaker’s tone of voice, how fast they talk, how high or low their voice is, and their accent. The model also learns emotions and other subtle markers of language that contribute to the synthesized speech’s life-like quality.
Exposing RealTalk to this data also ultimately allows the engineers to control and modify different traits of speech it learns from the samples, like the level of breath, or different levels of intonations and emotions.
In addition, exposure to audio data ultimately enables the system to predict emotions from the text alone, which are powered by an additional set of neural network layers that the engineers incorporated deep within the system. To use an analogy, you could think of this impressive capability as a way of teaching the system to learn how to read like an actor using a script as their sole reference of meaning.
Since similar neural speech synthesis approaches tend to be unstable when synthesizing long sentences, and also typically face difficulties producing convincing speech when trained on male voices, the RealTalk team also applied some “secret sauce” tricks to the system which mitigated both issues at once. These proprietary enhancements enable the system to generate around 30 seconds of speech consecutively, without seeing noticeable degradation in the synthesis quality.
Now, remember those spectrograms we mentioned? Okay, good. Here’s where they come into play:
After training RealTalk on pairs of audio and textual data, the system has the map of relationships between speech and text it needs to effectively generate net-new spectrograms from text that’s plugged into the system. As an example: the team types ‘Hey Joe Rogan, it’s me, Joe Rogan.’ into their model. After processing, the model outputs materials required for speech synthesis in the form of a spectrogram, drawing from its now pre-baked ability to pronounce words in the same way the real Joe Rogan would say them.
Spectrograms function more or less like heatmaps for sound, and are closely related to a more common visual representation of sound called waveforms.
At this stage of the process, it’s possible for the engineers to convert the spectrogram outputs into audio, enabling them to check in on the model’s process. The samples played from spectrogram by using deterministic algorithms (like Griffin-Lim) sound quite robotic, but also fundamentally capture the person’s voice they’re replicating.
That’s where the second system that powers RealTalk’s speech synthesis capabilities comes in, which is known as a neural vocoder. It has the function of taking the lossy spectrograms (the ones produced by the first system) and overlaying an additional layer of naturalness. This process transforms the spectrograms from something utilitarian into something that’s a work of art. In technical terms, the neural vocoder is tasked to invert these spectrograms back to audio while reconstructing the phase. To create this neural vocoder, the team employs a modified version of WaveNet, in the same way as Google’s Tacotron 2.
How RealTalk Works: The Data
After months of combing through thousands of clips, the final dataset the RealTalk team used to replicate Rogan’s voice consists of eight hours of clean audio and transcript data, optimized for the task of text-to-speech synthesis. The eight-hour dataset contains approximately 4000 clips, each consisting of Joe Rogan saying a single sentence. The clips range from 7–14 seconds long respectively.
Both the YouTube video and the clips featured in our Turing Test-like game were generated after being trained on this eight-hour dataset. Compared to precedents, this is a notable accomplishment, since state-of-the-art models have relied on 20 or more hours of data to produce comparable results (which still fail to be as convincing as the team’s faux Rogan).
As is the case with many real-world machine learning projects, assembling the dataset was actually the most time-consuming part of the RealTalk project.
Since the podcast is very noisy in it’s natural format, and Rogan is seldom the only speaker, significant editing was required to transform the team’s found data into something useable by the model. To make this happen, the engineers designed an efficient pipeline for turning the raw unstructured podcast data into something that could be used for training, which (roughly) looked like this:
Transforming Found Data Into Training Data
- Sourcing audio data from podcasts online. The team also used an automatic transcription tool to generate rough cuts of the corresponding transcript data.
- Reviewing the clips and handpicking the best examples based on range of vocabulary and portions of the podcast where Joe was the solo speaker for a sufficient amount of time. 10 hours of clips were selected in total.
- The resulting clips were sent to a local sound engineer for dataset optimization. He went through the clips and flagged the worst quality examples for removal, while also salvaging rougher ones by clipping parts that couldn’t be salvaged out. These included parts of the podcast with unnatural speech, yelling, weird voices, etc…
- Once cleaned up by the sound engineer, the clips were then sent out to workers on Amazon’s Mechanical Turk, a crowdsourcing marketplace that the ML community frequently enlists to help with simple tasks like data labelling. Workers on MTurk corrected the transcripts that corresponded to the clips to ensure they were model-ready.
After this process was complete, the engineers had the eight hour dataset they needed to craft the life-like clips heard in the YouTube video and on www.fakejoerogan.com. While training the model, the team also began experimenting with smaller datasets to train RealTalk’s speech synthesis capabilities. Using only two hours of data, they were surprised to discover that they could still generate a relatively convincing replica of Joe Rogan’s voice.
As they continue to iterate, the team’s confident that it will be possible to synthesize realistic-sounding speech using a much smaller dataset in the near future.
TL;DR: RealTalk’s Accomplishments
Ingenuity: The RealTalk team managed to create this project as a side venture to their main workloads, and despite this, has managed to achieve a new landmark for realistic voice synthesis. In addition, the system the team’s created is totally different in terms of architecture from what any other system for speech synthesis existing today. In particular, the team’s focus on making the model more stable for production is one area that few other researchers have focussed on.
Generating A Realistic Male Voice: Historically, AI speech synthesis systems have been difficult to train on male and other low-pitched voices. The reason why is quite technical, so we won’t go into depth here, but the main reason they are more difficult to replicate than female and higher-pitched voices is that male voices’ lower register appears less differentiated in spectrogram format. The RealTalk team overcame this hurdle by adopting a handful of proprietary techniques.
Say Anything: The RealTalk system can learn how to pronounce words that fall outside the original dataset it was trained on, which makes it possible for our faux Rogan to say things he’s never said on his podcast before. Certain examples we provided (like the tongue twisters, for example) showcase how mind-blowing these capabilities are — if you were to build an AI facsimile of anyone person with the system, you could literally get that person to ‘say anything.’
AI Voices With Feelings: Well, not actually. But with the RealTalk system, the team has managed to train the models to recreate a greatly nuanced level of emotional depth. That’s because the system the team developed to enable RealTalk’s strong ability to grasp emotion uses unsupervised learning, figuring out the relationships between various emotions and contexts automatically.
Recent Developments in Industry: There have been a number of fascinating projects showcasing AI’s growing ability to mimic real voices since we initially announced the RealTalk project in May. If you’re interested in exploring further, here are a few links we suggest checking out:
Making Editing Video as Easy as Editing Text at Stanford
‘Neural Talking Heads’ Created By Researchers at Samsung
Translatotron, Google’s latest voice-centred project
Recreating Bill Gates’ voice at Facebook with MelNet
How To Work With Us: At this time we’re interested in collaborating with organizations on deepfake detection. If you’re interested in learning more, please reach out to us via our Contact Page here.
Subscribe To Our Newsletter: Keep track of our company’s latest work, including what happens next with the RealTalk project, by subscribing to our monthly newsletter here. The newsletter also features a hand-picked selection of what we think are the most compelling AI stories and articles each month, so it’s a great way to keep track of the field as a whole.
RealTalk was created as part of Dessa Labs, where we devote time to dreaming up crazy ways we can use deep learning to solve some of the world’s toughest challenges. If you enjoyed reading about RealTalk, make sure to check out space2vec, another Dessa Labs project here.