Text to Speech and AI Powered Deepfakes

Fhel Dimaano
4 min readSep 25, 2019

Deepfake (a blend of deep learning and fake) is a technique for human image synthesis and speech synthesis that can depict a person saying things or performing actions that never occurred in reality.

A few weeks ago, I read an article about how AI was used to make a certain psychology professor turned right wing celebrity rap a song by Eminem. The controversial Canadian professor did not actually rap the lyrics to “Lose Yourself,” but his voice was generated using machine learning to match not only the sound of his voice, but also the cadence and rhythm of his speech. While the clip did not match the flow and delivery of Eminem,

It was more of an exercise on how to manipulate someone else’s voice to say anything.

This was achieved by using six hours of audio of the professor speaking, feeding it to a machine learning speech synthesis (text to speech) model that uses prosody as a reference to synthesize speech that is different from the “training data.” Prosody is the elements of speech that are not individual phonetic segments (vowels and consonants) but are properties of syllables and larger units of speech. These are linguistic functions such as intonation, tone, stress, and rhythm.

What are the different types of text to…

--

--

Fhel Dimaano

Data Scientist. Alum at Flatiron School. Android and Tech enthusiast.