The human-machine interface: voice is the new touch

Screen Shot 2016-04-27 at 3.07.01 PM

When it comes to the human-machine interface, the future is arriving fast, propelled by recent technology advances and a growing interest in voice recognition, Natural Language Processing (NLP), and other hands-off technologies.

Although it’s been around since the 1970’s, industrial and consumer device makers have been slow to adopt voice interface technology. But today, with the near-universal availability of cloud computing, coupled with advances in sound recognition software and semiconductor signal processing, the human voice interface is on the threshold of a new adoption curve, one that aligns with the popularity of home robotics.

Give Apple’s iPhone 4S credit for hastening the adoption of voice recognition technology. Apple’s Siri is an artificial intelligence (AI) companion designed to help iPhone users with daily tasks such as finding directions, making phone calls and adjusting a user’s calendar on the fly.

Not without its faults, Siri has also helped raise the public’s consciousness about voice user interfaces and their complex human interaction potential — best exemplified in Spike Jonze 2013 film Her in which Joaquin Phoenix plays a depressed man who falls in love with the disembodied voice of his new smart phone’s operating system.

The latest AI heartthrob is Alexa, the disembodied, robotic voice of Amazon Echo. Echo’s a 9-inch high, AI-powered cylindrical speaker that overcomes many of Siri’s shortcomings through the use of technology that enables Alexa to tap into its seven different microphones to hear voices coming from any direction and across the entire room.

It’s not just cloud computing and AI software that’s propelling the human/machine voice user interface into the future. As smart as Siri and Alexa may appear to be, when it comes to certain functionality, voice recognition systems have trouble finding and filtering user commands from ambient and background noise — a flaw that can often trigger false or improper responses or timing difficulties.

To increase system accuracy, semiconductor maker xCore has developed a technology called far-field voice capture, an integral element in the Amazon Echo. The technology integrates echo cancellation, noise suppression, and directional voice activity detection into an array of MEMS microphones that create a ‘beam’ that listens in different directions.

The array can listen for instructions, and then narrow in on the location of the voice by adjusting the phase of the incoming sounds, matching the different parts of each audio channel and eliminating noise from the signal. The result? Increased accuracy of the voice input as other sounds are less likely to interfere, allowing different voices to be separated not only by their sound but also by their location in the room. Over time, the MEMs controller can identify the position of audio sources such as the TV or radio or hi-fi system and either discount or double check any inputs from these positions in the room.

Voice-based user interfaces are expected to play a significant role in the connectivity and communication functions of the Internet of Things. As noted in the white paper Just Talk! Voice Controllers for the Internet of Things:

“For all of us, language is the natural way to communicate with the many and varied smart devices connected to the Internet-of-Things. With NLI and voice capture technologies now economical for integration into consumer products, proliferation is inevitable. Existing electronic products will be updated with voice capture interfaces while different types of products will emerge, many of which will have no visible interface at all. Just as language has enabled humans to communicate within a rich and diverse society, so natural language interfaces (NLI) where interactions are based on everyday words and phrases, will enable us to communicate with this infrastructure of IoT-enabled products in an efficient and intuitive way. “How do I work this?” — the answer is simple: Just talk!”

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.