Voice Interface: The Touchscreen of the Next Century
How AI and Signal Processing Came Together to Track the DNA of Sound
With its compelling tagline, “Signal processing that thinks,” Boston, Mass.-based startup Yobe Inc. has created software that can accurately track a voice’s “DNA” in any auditory environment, opening up exciting possibilities in a world where humans have begun talking to virtual assistants named Alexa, Siri, and Google to simplify their lives.
But Yobe is more than a voice company. While voice has emerged as a “killer app,” Yobe is at its heart a signal processing company, and the human voice just one of many auditory signals its mighty technology can isolate, identify, track, and put to good use.
“A year and a half ago we pivoted to voice based on a strategic bet that Amazon was going to make the market. We successfully brought three lines of research together just in time to take advantage of the voice tsunami. Now we live in a world where talking to your connected devices is a common capability.”
— Ken Sutton, Yobe president, CEO, and cofounder
“The Touchscreen of the Next Century”
The idea of making voice the primary way we interact with our smart devices is not just about the convenience of hands-free command or ease of use; it’s something more intuitive.
It’s more human.
“If you ask me ‘Why voice?’ or better yet, ‘Why are we talking to our devices?’ I’ll ask you a question in return: ‘What is the most natural interface between two sentient beings?’ The answer, of course, is speech,” says Sutton.
“The ways we have been interacting with machines up until now have been artificial because these machines haven’t been able to hear us. The natural way to communicate with something is to talk to it. This is not an evolution. We’re really getting back to basics — and these basics will have profound implications.
“Voice will be the touchscreen of the next century.”
A Series of Sonic Breakthroughs
Behind Sutton’s bold prediction lie several sonic breakthroughs he and his Yobe co-founders — Dr. S. Hamid Nawab, chief scientist, and James Fairey, senior advisor/audio innovation — have made in the fields of signal processing, artificial intelligence, and broadcast studio methodologies.
The fact that these innovations occurred over the course of 30 years in completely separate research fields — which Sutton likens with good humor to having as much in common as Spanish, Yiddish, and Vulcan — not only adds depth to Yobe’s “overnight” success story, but also a window into how brilliance, determination, good luck, and fate can intertwine to result in game-changing innovation.
A good place to start understanding the Yobe technology story is in the lab of Dr. Nawab.
Over the course of a distinguished 30-year career, Dr. Nawab used his advanced understanding of both signal processing and artificial intelligence — two highly specialized areas and skill sets rarely resident in the same person — to study an array of signal types including EMG signals, those biomedical markers that measure electrical currents during muscle contraction.
Dr. Nawab developed unique AI signal processing algorithms to decompose those EMG signals, isolating them so their relationship to individual muscle responses could be better understood and measured. Nawab was able to effectively separate individual EMG signals from noisy environments where multiple signals are firing.
In parallel to Nawab’s groundbreaking work, Fairey, a lifelong master of the music mixing business and radio studio production, was working to solve a problem close to his heart: his autistic son’s aversion to listening to music in closed-in environments.
Fairey took it upon himself to manipulate sound waves to find a way of presenting them in a way that his son would favorably perceive.
“What James stumbled across,” recalls Sutton, “was an audio file that passed muster with his son. However, the resulting sound was like nothing I had ever heard; it was like 3D or HD audio on steroids. Unexpectedly, when we compressed it — effectively reducing the amount of data on the file — something counterintuitive happened, it sounded even better.”
Fairey had stumbled upon a technique for signal repair. Manipulating signals typically damages them, which is one of many reasons, for example, that MP3 files can sound so tinny or hollow and why speech processing solutions sound artificial.
“When listening to a clip that was signal processed aggressively, you normally hear artifacts that negatively affect the sound quality. It won’t be natural sounding because you’ve damaged the underlying signal that you really want to preserve. ” — Ken Sutton, Yobe president, CEO, and cofounder
The work to automate the manual studio process is where the story intersects with the AI and signal processing world of Dr. Hamid Nawab. After working diligently to create IP around Yobe’s broadcast studio technology and methodology for both sound enhancement and signal repair, Sutton found that they were able to repair signals that have been “ripped apart” by aggressive signal processing — a problem that had been a challenge for Dr. Nawab and other scientists in the field of signal processing.
“Our broadcast studio signal-repair methodology allows Yobe to use a lot of aggressive AI-driven signal processing science — the domain of Dr. Nawab — on the front end, while forgiving us on the back end, because we can post produce the signal to bring it back to its true sound,” says Sutton. “This also enables us to see deeper into the signal itself, identify its DNA, and link it to its individual source and meaning. In the case of voice-enabled applications, we can move the needle from basic speech recognition (where computers understand what is being said) to speaker recognition (where computers also understand who is saying it).”
Yobe’s proprietary combination of signal processing, artificial intelligence, and broadcast studio techniques is overseen by a master adductive reasoning module that applies each discipline in exactly the right measure, audio frame by audio frame. Armed with this technology, Yobe is enhancing the performance of voice-enabled applications in noisy environments. These are the real environments in which we speak: ones with open windows, ambient sound, and the cacophony of conversation all around us.
In other words, it is the “cocktail party problem,” the signal-processing world’s way of framing one of its fundamental, longstanding challenges: isolating a single voice amidst the clatter of the real world’s sonic canvas.
“So much work in the voice recognition space is and has been done in controlled, sterile environments, which just isn’t where we as humans live, work, play, and talk,” says Sutton. “We took a different approach, and it has paid off.”
That said, Sutton is just as happy not to discuss Yobe’s accomplishments in terms of the cocktail party problem. He respects the efforts of others too much to take too much credit, preferring instead to say, “We’ve come up with a unique way of managing and dealing with it.”
That way is now leading to a new generation of applications and capabilities that are making our conversations with machines safer, more secure, and more productive. It is also ensuring that the touchscreen of the next century will operate well in the real world, not just in a soundproof room.