RealTalk: This Speech Synthesis Model Our Engineers Built Recreates a Human Voice Perfectly
Today we’re excited to announce that three Machine Learning Engineers at Dessa (Hashiam Kadhim, Rayhane Mama, Joseph Palermo) have produced the most realistic AI simulation of a voice we’ve heard to date.
It’s the voice of someone you’ve probably heard of before: Joe Rogan. (For those who haven’t—Joe Rogan is the creator and host one of the world’s most popular podcasts, which to date has nearly 1300 episodes and counting.)
Obviously, something like this has to be heard to be believed. So without further ado, check it out for yourself:
The replica of Rogan’s voice the team created was produced using a text-to-speech deep learning system they developed called RealTalk, which generates life-like speech using only text inputs.
Crazy, right? If you’re like us, and specifically, like our Principal ML Architect, Alex Krizhevsky, you’re probably thinking that it’s “one of the most impressive things I’ve seen yet in artificial intelligence.” Alex also noted that the work suggests that “Human-like speech synthesis is soon going to be a reality everywhere.”
What Does This Mean? Considering Societal Impact
It’s surreal for our engineers to be able to say they’ve legitimately created a life-like replica of Joe Rogan’s voice using AI. Not to mention the fact that the model would be capable of producing a replica of anyone’s voice, provided that sufficient data is available.
As AI practitioners building real-world applications, we’re especially cognizant of the fact that we need to be talking about the implications of this.
Because clearly, the societal implications for technologies like speech synthesis are massive. And the implications will affect everyone. Poor consumers and rich consumers. Enterprises and governments.
Right now, technical expertise, ingenuity, computing power and data are required to make models like RealTalk perform well. So not just anyone can go out and do it. But in the next few years (or even sooner), we’ll see the technology advance to the point where only a few seconds of audio are needed to create a life-like replica of anyone’s voice on the planet.
It’s pretty f*cking scary.
Here are some examples of what might happen if the technology got into the wrong hands:
- Spam callers impersonating your mother or spouse to obtain personal information
- Impersonating someone for the purposes of bullying or harassment
- Gaining entrance to high security clearance areas by impersonating a government official
- An ‘audio deepfake’ of a politician being used to manipulate election results or cause a social uprising
Obviously, though, not everything is doom and gloom. There are also some really good things that could come out of speech synthesis models. Here are some examples:
- Talking to a voice assistant in a way that feels as natural as talking to a friend
- Customized voice applications — for instance, a workout app that contains a personalized pre-workout pep talk from Arnold Schwarzenegger
- Improved accessibility options for people that communicate through text-to-speech devices, for example, people with Lou Gehrig’s disease
- Automating voice dubbing for any media and in any language
As the recent report “The Malicious Uses of Artificial Intelligence” by Oxford’s Future of Humanity Institute notes, new advancements in artificial intelligence not only expand existing threats, but also create new ones. (We highly recommend checking out the report, which is freely available to download here.)
We won’t pretend to have all the answers about how to build this technology ethically. That said, we think it will be inevitably built and increasingly implemented into our world over the coming years. So in addition to raising awareness and acknowledging these issues, we also want to show this work as a way of starting a conversation on speech synthesis that must be had.
Everyone should know what kinds of things are possible with the development of speech synthesis technologies. As we’ve seen with deepfakes, public awareness and dialogue also pushes governments, policymakers and lawmakers to take action and develop countermeasures swiftly.
A crucial advantage and responsibility we have as an applied AI company is knowing that there’s a huge difference between exploring AI in research and implementing it into the real world. To work on things like this responsibly, we think the public should first be made aware of the implications that speech synthesis models present before releasing anything open source.
Because of this, at this time we will not be releasing our research, model or datasets publicly.
Update: When we first published this article in May we promised a technical overview of the model and data by way of another blog post on RealTalk. That post is now available here.
For those of you reading, we encourage you to remember that speech synthesis is getting better and better everyday. On the horizon, it’s not outlandish to believe that the implications we mentioned (and of course, many more) will soon make their way into the fabric of society.
So pay attention! Join the conversation! Write to some relevant government officials! Knowledge is power, and we encourage individuals, companies and governments to think about how we can responsibly implement these technologies into our society.
Learn more about RealTalk: For anyone who has questions, feedback or inquires about the project, connect with us by email at email@example.com.
Curious about how RealTalk was built? Check out Pt. II of the blog post here for a technical overview of the text-to-speech synthesis model, data, and more.
We also encourage you to check out a Turing Test-style game the RealTalk team built to showcase the naturalness and intelligibility of this model, which can be found at www.fakejoerogan.com.
Please note that this project does not suggest that we endorse the views and opinions of Joe Rogan. Joe was selected as a demonstrative model for the purposes of displaying the capability of this technology.