A Brief History of Voice Control

Photo credit

Although it may seem that voice recognition and control is a new technology, it has been in the works since the middle of the 20th century. Only in the last five to eight years has voice recognition technology gained mass appeal. However, it goes without saying that voice recognition has traveled a long road before it reached where it is today.


IBM engineer William C. Dersch demonstrating Shoebox (Photo credit)

The road to speech recognition began with a system named Audrey, created by Bell Laboratories in 1952. Audrey was fairly rudimentary, only able to understand numbers spoken by specific people. A decade later came IBM’s Shoebox machine, which could understand 16 words spoken only in English by a designated speaker. These caveats proved to be problematic and only increased the skeptics opposing voice recognition.

From the 1950’s onward we saw numerous approaches to speech recognition that did little to advance the software. However, in the early 1980’s came the Hidden Markov Model (HMM). The HMM drastically altered the development of a viable speech recognition software. By way of HMM, speech recognition went from using templates to understanding words to a statistical method that measured the probability of unknown sounds being words. This allowed for the number of understandable words to go from a few hundred to a few thousand. The potential to recognize an unlimited number of words was on the horizon.


Dragon Dictate originally came out in 1990 for $9,000. In 1997 Dragon Naturally Speaking sold for $695, but still wasn’t the best. (Photo credit)

Yet, with all the innovation coming by the decade, at the turn of the century it seemed as though the technology had plateaued. The software that was out in the early 2000’s was expensive and still not accurate or easy enough to use for mass market appeal. It wasn’t until the end of 2010 that the technology made it into the hands of the masses. At this point, both hardware and technology innovation was at a place that made sense for companies to make voice technology available to the mass market. With the number of smartphone users growing, it was Apple’s first iPhone in 2008 that prompted Google to release a voice search app for the smartphone. Google was able to crowdsource data to enhance its voice technology from the billions of search queries it receives, in order to better predict what you’re probably saying.

Google Voice Search on the iPhone 1 in 2008 (Photo credit)

Smartphones proved to be the best platform to test voice recognition and control software because almost everyone in the modern world had one by 2012, and the computation could be done via the cloud. Yet, it’s in the past year that we see voice recognition and control marketed as the primary feature of a product. It only makes sense that voice control becomes a fully integrated component in most technology in the coming years. It is the most intuitive and hands-free way of interacting with our technology. The ability to speak to our technology naturally, as if it is another person, is what is catapulting voice control to the forefront of product innovation. Being able to process voice commands in English with more than 90% accuracy allows the field of voice technology to progress at a rate that will be hard to keep up with.


Josh.ai — artificial intelligence for the home

Here at Josh.ai we recognize that voice technology is the way of the future. Integrating with other voice control products such as Amazon’s Echo and Google’s Assistant, we can truly capitalize on the power of voice technology. Doing so makes life at home and on the go increasingly effortless by way of voice commands and machine learning. At Josh.ai we strive to create a product that is not only intuitive and able to handle complex commands, but also able to learn more about the user and predict what they need.


Fusion of human head with artificial intelligence (Photo credit)

Looking to the future, it is becoming important to recognize the evolution of what voice control is going to achieve. Accordingly, the easier it becomes for our machines to understand us, the less we see them as a piece of hardware and more as a person. The conversations we have with our machines will mimic the conversations we have with friends and family in that our machines take note of our habits and tendencies and will act accordingly — without us even giving a command. These voice technologies will learn on their own, much like a child maturing into an adult. With the right guidance by the user, our machines will become our virtual progeny.


This post was written by Benji at Josh.ai. Benji is an intern on the Business Development team. Prior to joining, Benji worked in product strategy for Sonos. Outside of work you can find Benji eating at new restaurants, cycling throughout Los Angeles, and woodworking. Benji graduated from UCSB in 2016.