Google’s WaveNet Could Replace Your Voice

Artwork by Daniel Palacio

Technology has a habit of eating jobs. We saw it in the 19th century (machines eliminated 98 percent of the labour required to weave cloth) and we are seeing it today (just this year robots replaced 60,000 factory workers in China.) These examples usually get the most attention since the number of employees involved is significant and visible. What we don’t see as much, however, are the minority jobs that are also disrupted by technology.

For example, a small group of medieval monks who used to painstakingly copy books by hand were put out of work by the printing press; knocker uppers who used to manually wake up industrial workers in Britain lost their jobs to alarm clocks; and milkmen who used to deliver milk on a daily basis were put out of work by something rather plain: a fridge.

A knocker upper: the human alarm clock

Such disruption will continue into the future and I can’t help but wonder what other unsuspecting jobs will face competition from the machines. With that said, finding examples isn’t too hard.

Last week Google announced WaveNet, a new speech synthesis platform developed by the company’s artificial intelligence division DeepMind. Instead of relying on pre-recorded audio samples that are stitched together or robotic-sounding vocoders, WaveNet uses deep learning concepts to model raw audio waveforms that sound very human.

How real do these artificial waveforms sound? In a blind test involving human subjects and over 500 ratings of 100 test sentences, WaveNet outperformed all the other major text-to-speech systems. Check out the samples below to hear the differences.

1. Concatenative system: sample 1, sample 2
2. Parametric system: sample 1, sample 2
3. WaveNet system: sample 1, sample 2

WaveNet isn’t quite there yet but it’s significantly better than the alternatives. Furthermore, since the software models raw waveforms users will ultimately be able to fine tune accents, tone, as well as emotions. Just think what this could mean for Hollywood and the media industry in general.

At first, systems like WaveNet could be used to add minor bits of dialogue to a movie when an actor isn’t available for additional takes. His or her voice would of course have to be fed to machine learning software. But once this process is complete, machines could step in for an actor when it’s too expensive to get them in a recording studio again.

Fast-forward to a time when WaveNet is indistinguishable from human speech and it could very well replace the voices of humans in animated movies and other forms of media. The level of control engineers will be able to exercise with speech software (e.g. accent, tone, emotion) will surpass anything that can be achieved with a traditional voice actor in a recording studio. It will be faster, more accurate, and cheaper than directing a human being.

But speech software applications don’t stop there. As a hobbyist musician, I look forward to a time when I can make beats and program a Beyoncé or Michael Jackson voice simulation, all without having to pay millions for an exclusive or bringing a legend back from the dead!


Thanks to Wiza Jalakasi and John Sambo for reading an early draft of this blog post.