The exciting future of voice synthesis technology

Euphony
5 min readJan 10, 2017

--

There’s no denying it, voice synthesis technology has become an integral part of our daily lives….and it’s just getting started.

The pursuit of voice synthesis, or the artificial production of human speech, entered history as early as 1789; well before electronic signal processing. However, things really picked up in the 1930s when Bell Labs developed the vocoder. [1]

Since then, processing power has gotten cheaper and more powerful (see: Moore’s Law), and the quality of voice synthesis technology has steadily improved. With that said, the output from contemporary voice synthesis systems still remains clearly distinguishable from that of actual human speech.

So, what’s next?

The future of voice synthesis will center upon three core areas of focus. The first being the use of neural networks [2] to generate speech. This will involve large collections of synthetic neural units (which model the biological brain) acting upon extensively large amounts of ordinary recorded speech data. The second will be voice experience (VX), or the notion of optimizing voice as a channel or interface. The third will be emotive voice, or the ability to generate truly authentic voice outputs which are indistinguishable from humans.

Let’s take a deeper look at each.

Interested in voice synthesis and text-to-speech solutions? Sign up to receive the latest news and insights!

Neural networks to generate speech

Voice synthesis today is largely based on two text-to-speech (TTS) approaches: contentative TTS and parametric TTS. Concatenative TTS is when a large dataset of brief speech fragments are recorded from a single speaker and then reassembled to form complete utterances. Parametric TTS on the hand is when all of the data necessary to generate speech for a particular voice is stored in a model and then acted upon by the model’s parameters and rulesets.

Both approaches have their shortcomings which is why today’s synthetic voices are inauthentic and robotic-sounding.

A new effort by Google, WaveNet, was recently proposed that introduces a new paradigm. WaveNet simulates the sound of speech at the lowest level possible, one sample at a time. This requires building a waveform from scratch — 16,000 samples per second [3].

The animation below, from DeepMind, shows how this

So what makes this possible? You guessed it. Neural networks.

In the case of WaveNet, their model receives a tremendous amount of ordinary recorded speech data. Then upon feeding it to the neural network, a complex set of rules are established that determine patterns in tones. These patterns are based upon which tones follow other tones in all of the most common contexts of speech. Each sample is then determined not just by the sample before it, but the thousands of samples that came before it [4].

Such neural networks require extensive amounts of computing power. Regardless, the possibilities they introduce when it comes to voice synthesis and audio modeling are endless.

VX

Along with neural networks, VX will heavily influence the future of voice synthesis.

So, what is VX exactly?

VX is still very much in its infancy and has an assortment of definitions. In short, VX speaks to the holistic experience associated with how individuals use voice and / or synthetic voice systems to interact with people, data and actuators in order to satisfy user need.

Tech giants like Amazon and Apple are heavily focused on VX because speech interfaces offer an unprecedented level of control over the user channel. That is, it’s a direct channel for Amazon and Apple to interface with their existing and prospective customers. Voice as a channel is powerful because it’s fast, convenient, lightweight and a part of everyday life. It introduces immense potential and improves ways in which humans communicate with machines and ways in which humans use machines to communicate with other humans.

The problem is, without strong VX, all of the benefits of voice as a channel disappear. In coming years much of the focus will center upon ensuring voice experiences are pleasant, authentic, efficacious, and ultimately, facilitate growth.

Much of this will depend upon emotive synthetic voices that sound natural and humanlike.

Emotive voice

When it comes to our voice, it’s very personal to us. It’s a core component of our identity. Our voice conveys mood, attitude, emotional state and personality.

In the context of synthetic voice, the quality and authenticity of voice outputs makes a tremendous impact upon facilitating meaningful communication exchanges. Today, unfortunately, synthetic voices are robotic-sounding, unnatural, and they significantly hamper understanding and limit impact (note: our software platform is working to address this).

Much of the future of synthetic voice will revolve around “emotive analytics.” It’s a new field that focuses on identifying and analyzing the full spectrum of human emotions. Emotive analytics will enable synthetic voice systems to analyze the underlying emotional state of a user’s voice and respond appropriately given the scenario.

In closing

Much of the focus in coming years will center upon enhancing voice quality and ensuring voice as a channel is optimized.

To do this, neural networks will play an integral role. Neural networks are primed to improve voice quality and introduce innovative ways to model audio. Additionally, increased emphasis will be placed upon VX; much will be done to ensure experiences are intuitive and impactful. Lastly, emotive voice will tie it all together. It will address shortcomings associated with context, intent, meaning and personality. Synthetic voice will *actually* sound humanlike.

Speech technology has come a long way in the last decade. Think about it — just 10 years ago the main use of speech technology was for automated phone systems. Now it’s 2017 and we’ve reached a tipping point. Synthetic voice technology is on the verge of becoming a ubiquitous tool that’s intimately involved in our everyday lives.

One can’t help but ask, if we come this far in just 10 years, where will we be 10 years from now?

Enjoy this article? Consider clicking the heart shaped icon below and signing up for our voice technology newsletter!

[1] The vocoder was a speech coder for telecommunication applications that would code speech as means to reduce bandwidth for multiplexing transmissions.

[2] Neural networks are a computational approach which is based on a large collection of neural units loosely modeling the way a biological brain solves problems with large clusters of biological neurons connected by axons.

[3] See: WaveNet: A Generative Model for Raw Audio

[4] Google’s WaveNet uses neural nets to generate eerily convincing speech and music

--

--

Euphony

A platform for generating authentic text-to-speech voices with emotional range.