An Emerging Application of Modern Voice Recognition Software

Did you see that episode of Netflix’s “Black Mirror,” titled “White Christmas,” where Don Draper, err… Jon Hamm, reveals that he once offered a service where individuals could undergo surgery in order to create a digital clone of their conscience, which “lived” in a counter-top appliance as a personal assistant to the home-owner? The one where the clone appears to have a physical body, free will, and a conscience of her own, but is forced into servitude and obedience through months of solitary confinement, but which are only seconds or minutes in “real-time?”

Jon Hamm enjoying toast while forcing the AI clone into perpetual submission

Still no? That’s okay. I’m mainly referencing the show because 1) I think it lives up to the hype, and 2) because I want to talk about voice recognition software (and hardware).

Whether or not you realize it, many of us are exposed to and use modern voice recognition tools constantly. But what is it and how has it evolved? Though interest in this type of communication has a comprehensive history, modern voice recognition software has origins as a keyboard alternative that allows users to quickly and accurately translate their speech into text on the computer. While the average person can type fewer than 60 words per minute, we’ve all met those who can speak at nearly double that rate. And while the evolution of voice recognition technology spans over 200 years, not until recently has it really taken off. With the release of Google Voice in 2008 and, subsequently, Apple’s Siri, Microsoft’s Cortana, Amazon’s Alexa, and Google Home, this technology has certainly captured the imagination of modern inventors and consumers alike. When used properly, the software can recognize what we say with practically 100% accuracy and now that these companies have released devices that exist not just on our phones but in our homes, consumers continue to purchase compatible technologies — refrigerators, security systems, light bulbs, and more — that these devices can link to and control.

PSA: Choose your words wisely

While eagerly opening your doors to these companies’ “listening” devices might be cause for concern, it’s also a good idea to consider to what this means for those who could benefit from these voice recognition technologies. Along the same lines as my previous article on web accessibility for the visually impaired, it’s worth noting how modern speech recognition tools can improve life for those with other disabilities as well. While you may be well aware that consumers can use these devices to order groceries and household items, control other devices in their home, and search the internet, perhaps there is more to this voice recognition technology than what is commercialized.

At a macro-level, voice recognition software “understands” what the user is saying by translating the sound waves of input speech (via an “analog to digital sound converter”) into a digital representation the computer understands. It then sends this translation to a central server for decryption. Next, the program breaks the speech into decipherable parts, known as phonemes. By analyzing the order, composition, and context of the 44 distinct phonemes of the English language, the software can “guess” what the user is trying to say and respond accordingly. We have the programmers of voice recognition technology to thank for doing the grunt work of training their software to match speech patterns with their corresponding text equivalents.

If these assistive technologies have the ability to understand language and translate it into text and actions, it would follow that the opposite is also true. A counterpart to speech-to-text programs, text-to-speech software can help millions without the ability to speak communicate, and voice recognition is at the forefront of this evolving technology. Individuals lose or never develop the ability to speak for a variety of reasons — cerebral palsy, brain injury, stroke, multiple sclerosis, and autism to name a few — and over 2 million people require alternative forms of communication in the U.S. alone.

The device famously used by Stephen Hawking, while a revolutionary advancement in “adaptive alternative communication” and one used by millions, also demonstrates how one’s voice is an essential component of their identity. One’s voice provides valuable information about the speaker — age, inflection, intonation, gender, mood, emotion, etc — but until recently, text-to-speech technologies were limited by the variety of voices available to speechless users.

Jordan Kisner writes in The Guardian,

“Walk into a classroom of children with voice disorders and you’ll hear the exact same voice all around you,” Rupal Patel of VocaliD told me. Ten years ago, she was at a speech disorders conference when she came upon a little girl and a man in his mid-50s who were using their devices to have a conversation. They were speaking in the same adult, male voice. Patel was horrified. “This is just continuing to dehumanise people who already don’t have a voice to talk,” she told me.

This lack of variety has robbed people of a major contributing factor of individuality and identity. Fortunately, Rupal Patel and her team at VocaliD have created an online “voice bank” where volunteers record themselves reading strategically composed stories designed to capture all of the phonemes (see above for clarity) of the English language. Using these recordings — just 1,000 sentences, though more is better — VocaliD is able to create a voice and add it to a database of donated voices. Clients who have lost their ability to speak can then choose a voice that most closely matches their own. Those that have always been speechless can select whichever they feel best suits them because, as Patel says,

“We wouldn’t dream of fitting a little girl with the prosthetic of a grown man. Why then the same prosthetic voice?”

Watch Rupal Patel’s Ted Talk here.

VocaliD’s technology is designed to combine the “source,” certain unique sounds produced by the vocal cords, larynx, and throat muscles, from the recipient and the “filter,” the muscles that form words out of those vibrations, from the donor into a personalized voice for the client. The voice is then added to the client’s own communication devices.

VocaliD voices are crafted using contributions from the recipient and donor

So, as we explore the potential of voice recognition programs and devices, it’s important to look beyond the ease with which we can now re-order toilet paper and appreciate how some people are using that same technology to create voices and newfound identities.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store