The Voice Enabled Revolution

As voice recognition technology has improved in recent years, I’ve become convinced that voice will be the user interface of the future.

Published in

Startup Grind

8 min readJun 21, 2017

Why? When done well, voice wins over any other input method. Typing is the primary mechanism we use to interface with machines today. Yet most English speakers type only about 40 words per minute (wpm).

We can speak at about 3x that rate (130 wpm), read at about 6x that rate (250 wpm), and listen at 10x that rate (450 wpm).

In our quest to optimize for speed, efficiency and convenience, it’s clear that we are moving to a world where we primarily speak to machines and read and listen to their responses.

Why create a market map?

The voice platforms of large incumbents like Apple, Amazon, Google and Baidu are currently driving developers and companies to create innovative voice native applications.

To better understand this ecosystem, I’ve created a market map that highlights some of the startups and technologies at the forefront of this revolution.

Inspired by the market map made by David Beisel of NextView Ventures, I broke my voice market map into specific vertical foci within both enterprise and industry. As well, I used the horizontal technologies that enable voice computing across a vast array of problem types (from natural language processing and speech synthesis to analytics).

Which companies populate the landscape?

It’s often hard to pigeonhole startups into one specific category. But, having spoken to most of the founders of these companies, I have simplified by grouping them according to their primary use case.

In general, I chose companies that position themselves as voice-enabled or voice-enabling, intentionally omitting those whose core focus is on the underlying technology.

For instance, take natural language processing (NLP). In the healthcare market, Clinithink, Sytrue, and Zephyr Health are using NLP to handle structured and unstructured data, but are not explicitly positioned for voice, so they don’t appear on the map.

With such a broad landscape, the chart is just a small subset of the overall voice ecosystem. If I’ve missed your startup or other key players, I’d love to hear from you. :)

A voice technology landscape

Reflections on the landscape

The most exciting thing for me to see is all that is happening within the “platform / core technologies” layer. Startups are beginning to build broad development platforms to empower voice interaction.

Some of these companies are radically influencing the way audio capture is done, using hardware tech such as chipsets and microphones (Vesper, VocalZoom, Mythic).

Others are enabling companies to create their own voice assistants (Sayspring, WitLingo), while still others are leveraging natural language to break down speech for grammar and meaning (Narrative Science, Cortical.io).

The voice-enabled technology wave will need to leverage platforms to build scale. New platforms will enable new voice native applications, which will in turn make the platforms more valuable.

On a high level, it’s easy to understand why conversational AI is important, but it wasn’t until I laid out the list of startups that I started to understand why this space was so crowded.

The market wants solutions that can replicate human-to-human conversation but conversational AI is hard. Unlike previous solutions that utilize if-then statements, the voice interfaces of today need to understand intent and remember context.

For anybody wanting to build a conversational AI today, they will need to overcome three challenges:

Human hearing is complex and highly nonlinear.
Human-like memory is needed to exploit long-term context.
Human-like attention is required to relate specific inputs to outputs.

To this end, we’re beginning to see the use of recurrent neural networks (RNN) and other speech recognition models to enable human-like contextual awareness that identifies which pieces of information to remember, which to update, and which to pay attention to.

Deep learning has made amazing achievements in the past few years. Now that simple voice commands are practically taken for granted, deep learning is enabling more complex interactions such as contextual conversations.

The rest of the map is separated into enterprise, industries, and accessibility.

On the B2B side, companies like DigitalGenius and Syllable are working to unlock voice data to drive efficiency in customer support; Chorus.AI and VoiceOps are doing the same in sales.

On the industry front, fintech startups are providing voice assistants to streamline banking (Kasisto, Personetics) and other manufacturing startups are predicting machine failures through anomaly detection, all done acoustically (3D Signals, Otosense).

On the consumer side, large brands and independent developers are leveraging the Alexa platform to build voice applications and heavily investing in tools for application creation.

With the growth of voice-as-a-platform, the voice-native application layer will soon become the next frontier to undergo a substantial change.

Although voice systems are primarily built for standard speech, a group of dedicated startups is striving to make accessibility a core focus in speech recognition systems today.

According to the World Health Organization, over 5 percent of the world’s population — 360 million people — have disabling hearing loss. Ava is using voice-recognition software to translate conversations into text for people with hearing impairments.

Companies like VocalID and VoiceItt are personalizing text-to-speech synthesis for individuals who live with speech impairments, or who just have strong accents.

Opportunities for entrepreneurs

Improving voice control by solving the challenges of far-field speech systems.

Imagine a common scenario in which one person is indoors, speaking to an Amazon Echo.

The audio captured by the Echo will be influenced by:

The speaker’s voice against the wall of the room.
The background noise from outside.
The acoustic echo coming from the device’s loudspeaker.
The output audio against the wall of the room. All of these factors affect the quality of sound and hinder the ability of the device to listen and understand voice commands.

These are some of the challenges faced by existing microphone technology today. To meet them, companies like Vesper are revolutionizing microphone design through piezoelectric technology.

This tech helps keep small battery-powered devices in a low-power wake-on-sound mode, while others, like VocalZoom, are avoiding microphones altogether, using optical sensors to capture human speech through facial vibrations.

While “better” microphones are definitely cooler, “new” microphones are still interesting — for instance, Abe Davis’ research from MIT used a regular camera and a potato chip bag.

Take a look at Gierad Laput’s research from CMU. He used a single small sensor board, the ultimate in a low-tech/dirty environment/open back chamber.

I’ll be watching advances in adjacent industries (like piezos, lasers, and cameras) to see how they might radically influence the way audio capture is done.

Companies looking to transform the healthcare market.

Adapting voice technology to healthcare has proven to be tricky. Unlike other industries, in which a minimum viable product is good enough, the opposite holds true in healthcare, where accuracy can literally be a matter of life and death and confidentiality is crucial.

Patient engagement and interpretation is highly subjective and dependent on many variables, including the individual’s age, medical history, and/or risk factors.

The successful solution will require some level of empathy and approachability to provide patients with an informative, engaging, and satisfactory healthcare experience.

Moreover, a fundamental challenge is the use of domain-specific words that training datasets simply do not have. Patients also often describe things with non-medical language — so machine learning has to translate between metaphors and imprecise language to medical terminology.

Voice-activated applications in the China market.

With the world’s largest population, China has more mobile users than anywhere in the world.

A majority of them already use voice-to-text capabilities. In fact, the Chinese voice market is estimated at RMB 4.68 billion, making up around 12% of the global market.

It’s not uncommon to see people talking into their phones to send short audio clips instead of texting (which admittedly has saved me from learning how to text in Chinese).

*WeChat App User (Original photo by bnextbeta, Creative Commons)*

In China, both messaging and voice interfaces play central components to people’s digital lives, due largely in part to the success of WeChat. This combination brings new potential for brand-to-consumer interaction in an ecosystem which processes around $550 billion in person-to-person payments every year (twice the rate of Paypal).

Commerce is ubiquitous. And since people in China are quick to adopt voice and already tied to the WeChat ecosystem, companies that can find novel and innovative commerce applications that are both contextual and conversational to the customer, I would be interested.

Treating voice privacy as a first-class citizen.

The reality is, if we want better voice recognition, companies are going to be hungry for our voice data. I’m interested to see what sorts of solutions come about to provide safeguards, not only in terms of how the data is captured and stored, but also how the data is accessed.

Some privacy concerns might be alleviated by smarter edge devices that process data locally instead of sending it to the cloud, and biometric identification might alleviate access concerns.

However, with startups like Oben and Lyrebird synthesizing our voices, how will we know for sure if we are talking to a living being? Soon even hearing will no longer be believing.

Speech recognition and vocal computing have reached an inflection point.

According to Mary Meeker’s 2017 Internet Trends report, the word recognition accuracy rate has reached 95 percent, effectively achieving human parity.

With this kind of accuracy, we can be sure that ultimately, vocal computing is going to replace the traditional graphical user interface.

And as the voice development platform and core technology advances, voice recognition features will become more nuanced and useful. Voice recognition will continue to drive a more sophisticated smart home market.

Deep learning will continue to enable more complex interactions, like contextual conversations and emotion detection. Always-on voice biometric authentication will allow the machine to tell who is talking to it without vision capabilities.

As we move toward this future, I’d love to talk to anyone building companies at the forefront of this voice-enabled revolution.