Voice is the next big thing

Part I: Why now?

Voice Series #1 — If you are an early stage startup in this space or just interested in exchanging thoughts on the topic, feel free to shoot me an email.

Voice is the most natural way of communication, but hasn’t been a major interface with machines yet. Since Edison’s phonograph, people have been talking to machines — but mostly to communicate with each other and not with the machines themselves. By the 1980s, speech recognition technology started to be accurate enough to transcribe spoken words into text. In 2001, computer speech recognition was reaching 80% accuracy. We could then start extracting meaning from spoken words and respond to them. However, technology was still not good enough to allow a better experience than interfaces like keyboards in most use cases.

Over the last few years we have made huge technological progress. The accuracy of speech recognition engines has improved a lot and is now reaching 95% accuracy, which is slightly better than the human success rate. As this technology improved, a voice-first infrastructure became more and more relevant, resulting in voice-first hardware, software building blocks and platforms being rapidly deployed by Amazon, Apple, Google, Microsoft and Baidu. Now seems the time for voice!

Let’s now have a closer look at i) how we got to the current state of speech recognition technology and ii) how the infrastructure has been developing around voice.

I) A history of speech recognition

Speech recognition is not new and has roots back in the 1950s, but various approaches have been taken over the years to understand speech. I tried to summarize a bunch of articles on the subject to get a high level understanding of what happened over the last decades.

All sources I used can be found at the end of this post but I would like to do a special mention of Voice Recognition Software by Chris Woodford that served me as a main basis for this section.

1950s/1960s

The first speech recognition systems were based on simple pattern matching. A good example of these early systems would be an automated system used by utility companies to let their clients leave their meter readings. In this case, the client answer to the system was one word or number amongst a limited list of options, and the computer just needed to distinguish between a limited number of different sound patterns. It did this by comparing each sound chunk with similar stored patterns in its memory.

In 1952, a team at Bell Labs designed the Audrey, a machine capable of understanding spoken digits.

1970s

Advancements in technology led to the development of speech recognition systems based on pattern and feature analysis, where each word is broken into bits and recognized from key features, such as the vowels it contains. This approach involves the digitization of the sounds and the conversion of that digital data into a spectrogram to break it down to acoustic frames, separate the words and identify the key feature of each one. To identify what has probably been said, the computer has to compare the key features of each word to a list of known features. The system gets better the more it is used as it integrates the feedback from its users. This method was much more efficient than the previous one as spoken languages have a fairly limited number of basic component sounds.

From 1971 to 1976 DARPA funded five years of speech recognition research with the goal of ending up with a machine capable of understanding a minimum of 1,000 words. The program led to the creation of the Harpy by Carnegie Mellon, a machine capable of understanding 1,011 words.

1980s

But the previous technique was still not super accurate as the complexity involved in speech is massive: different people can speak the same word in a different way, there are many similar-sounding words (eg. two and too), etc. To counter that, speech recognition systems started to use statistical methods. The key technologies introduced during this period were the Hidden Markov Model (HMM), used to build acoustic models, and stochastic language models.

Acoustic models represent the relationship between the audio signal and the phonetic units in the language to reconstruct what was actually uttered (feature → phoneme). Language models predict the next word based on the last words eg. “Queen” is a much more likely continuation to “God save the“, than most other words (word → sentence). In addition, there is a phonetic dictionary/lexicon that provides data about words and their pronunciations, and links acoustic models and language models (phoneme → word). Ultimately, the language model score is combined with the acoustic score for the current word to determine how probable the hypothesised sequence of words is.

The World of Wonder’s Julie Doll, a toy children could train to respond to their voice, brings speech recognition technology to the home in 1987.

Source: https://www.inf.ed.ac.uk/teaching/courses/asr/2011-12/asr-lexlm-nup4.pdf

1990s

Until the 1990s, speech recognition systems were too slow to develop useful applications but the introduction of faster microprocessors at that time allowed major improvements and the first speech recognition commercial applications started to emerge.

Dragon launches Dragon Dictate in 1990, the first speech recognition product for consumers. In 1997, you could speak 100 words in a minute.

2000s

Computer speech recognition was reaching 80 percent accuracy in 2001 and not much progress was then observed.

2010s

Over the last decade, advances in both machine learning algorithms and computer performance have led to more efficient methods for training Deep Neural Networks (DNNs).

As a result, speech recognition systems started to use DNNs and, more specifically, a special variant of DNNs, the Recurrent Neural Nets (RNNs). Models based on RNNs were then showing much better accuracy and performance than traditional models. In fact, speech recognition accuracy was reaching 90% in 2016 and Google claimed to have reached 95% accuracy in June 2017.

This is pretty amazing, knowing that researchers estimate human transcription accuracy to be slightly less than 95%. However, these published results should be considered carefully as they are usually measured in perfect conditions eg. recordings with no background noise and native English speakers. The accuracy can quickly go down to 75–80% in “non-sterile conditions”.

As you need labelled data to train the algorithms, the challenge is now about getting thousands of hours of spoken audio recorded in real-life situations to feed the neural nets and increase the accuracy of speech recognition systems. And that’s what Google, Amazon, Apple and Microsoft are doing by putting Google Now!, Siri and Cortana on every cell phone for free or selling Alexa units for a cheap price. It is all about getting training data!

II) Voice infrastructure development

The voice infrastructure development can be broken into 3 necessary layers for new applications to emerge: (1) hardware to allow more people to use voice as an interface (2) software building blocks to enable developers to build relevant voice-first applications and (3) ecosystems to enable efficient distribution and monetisation

A) Proliferation of voice-first hardware

Voicelabs defines a voice-first device as an always-on, intelligent piece of hardware where the primary interface is voice, both input and output. The first voice-first hardware on the market was Amazon Echo at the end of 2014. According to the 2017 VoiceLabs Report, there were 1.7 million voice-first devices shipped in 2015, 6.5 million in 2016 and there will be be 24.5 million devices shipped in 2017, leading to 33 million voice-first devices in circulation.

The main speakers on the market are Amazon Echo (November 2014) and Google Home (November 2016). However, new players are rapidly entering the game : Sony launched the LF-S50G powered by Google Assistant (September 2017), Apple will soon release Homepod (December 2017), Samsung also recently announced that they will release something “soon” and Facebook may release a smart speaker with touch screen. Google assistant will also be coming to a number of new speakers, including the Zolo Mojo by Anker, TicHome Mini by Mobvoi and the GA10 by Panasonic.

No doubts that the voice-first hardware layer is developing fast and is expected to grow!

B) Democratization of the software building blocks for voice-first applications

Building a voice-first application from scratch is not an easy thing to do. Nuance and other big companies in the space have been offering speech recognition APIs to third-party developers, but the cost to use these APIs has been historically quite high and have not delivered amazing results.

As speech recognition technology started to deliver much better results, the potential for voice-first applications became larger and big companies like Google, Amazon,IBM, Microsoft and Apple, as well as smaller players like Speechmatics started to offer various API products at a lower cost.

Some of the most used ones include the Google Speech API released in July 2016, Amazon Lex and Amazon Polly released in November 2016.

A large number of developers can now start to build voice-first applications at a reasonable cost.

C) The emergence of voice-first ecosystems

As more and more voice-first applications and hardware enabling voice interface emerge, platforms taking care of not only the distribution and monetisation, but also third party services like analytics and marketing automation, become very relevant.

Amazon, Google and Microsoft have already started to build such ecosystems and Apple is expected to start soon. A good way to measure the success of these ecosystems is the total skills:

The next article of this Voice Series will be an attempt to give a high level overview of the opportunities for startups in this space, Stay tuned! In the meantime, if you are an early stage startup in this space or just interested in exchanging thoughts on the topic, feel free to shoot me an email.

Sources