When our machines first began speaking to us, it was in the simple language of children. Some of those voices were even designed for kids — my Speak & Spell was a box with a handle and a tiny green screen that tested my skills in a grating tone, but I still heard that voice sometimes in my dreams. Teddy Ruxpin’s words played from cassette tapes popped into his back, but his mouth moved at just the right cadence, which made him feel almost alive. At least to a kid.
For adults, however, the clunky computerized voices of the 1980s, ’90s, and early aughts were far from real. When the train’s voice announced that the next stop was Port Chester using two words instead of “porchester” — we knew: That was a machine. It could not know that we New Yorkers pronounced this place as one word, not two. It was simple: A voice that sounded human was a person; a voice that sounded like a machine was a machine.
This was fine when all we needed were announcements that were basic, short phrases. But if there is a fire on the train, we all instinctively want to hear a human voice guiding us — and not just because it would calm our nerves. It’s because, as studies have shown, mechanized voices are very difficult for us to comprehend for anything longer than a short sentence. We’ve evolved to read nonverbal voice cues while we listen to our fellow humans, and we get distracted when they’re missing — that distraction is what makes computerized voices tough to follow.
If were are going to replace assistants (or ourselves) with Google Assistant, or if we want a real conversation with the Alexa of the future, it has to converse like a human — responding to verbal cues and following the rhythm, music, and often freewheeling flow of human conversation. To be truly useful to us, in other words, we need computers to sound human. And that’s extremely difficult.
What stands in the way? Prosody. That’s the intonation, tone, stress, and rhythm that give our voices their unique stamp. It’s not the words we say—it’s how we say them. “The secret to the human voice is the melodies,” says Emma Rodero, a professor in the Department of Communication at Pompeu Fabra University in Barcelona. Rodero has researched nonhuman voices extensively and says that outside the actual words we use, there’s so much going on that it’s tough to teach a computer all of it.
What we hear now are manipulated human voices, chosen for us by the people who create them; a voice-only Frankenstein.
“Intonation is a combination of four qualities: tone (the most important), speech rate, intensity, and loudness. I can do multiple combinations of those when I talk. Siri can’t,” Rodero says, who says she has worked with voice engineers and provided them with a list of intonations connected to emotions, including joy, sadness, and everything in between. But there’s the inherent limitation of being a machine — they can spit out only what we put in—and each of us is unique in myriad ways. “When you are happy, you have a lot of ways to express this happiness in your voice. The problem is that we cannot put that into a computer,” Rodero says. “This is a problem for engineers: Algorithms are limited, but my voice is not limited.”
Tech companies have gotten around some of this by choosing a human voice with lots of personality to input into their A.I. — which then puts them together in new combos to form speech — from the get-go. When it came time to choose the voice for IBM’s Debater (an A.I. designed to debate humans), the company held an audition and chose 20 voice actors. The winner was picked via a subjective judgment by the IBM team, who asked themselves questions about which debate-style voice they preferred: “Was I moved? Did he or she convince me? Did they have the right amount of persuasion and passion?” says IBM’s Andy Aaron, who worked on Project Debater.
That was just the start of creating the Debater voice: “We collected something like 150,000 words [from our voice actor], which amounts to 20 hours of speech recorded,” says Ron Hooray, also with IBM’s Project Debater. “A team of labelers had to spend a lot of time to annotate it according to word emphasis and then run a lot of analysis. Then we divided that into phonemes, and for each one, we have a lot of metadata — whether the pitch is low or high, the duration, etc.—and we had to extract intonations. Then we had to do a lot of manual correction.” They also applied deep learning to get prosody right — or at least close, Hooray says.
Amazon is very focused on prosody for Alexa and also spent time looking for the right voice that “has a personality to reflect Alexa’s persona — smart, humble, and helpful,” says Manoj Sindhwani, the director of Alexa Speech at Amazon. But that voice will differ depending on what Amazon calls “locales.” Alexa now speaks in six languages, and its programming reflects 14 localized experiences. “We select a new voice that appeals to our customers in that locale, making sure that the voice reflects the Alexa persona [there], building language understanding, helping her to understand semantics and context that may differ by region, and developing a local ‘personality’ that will surprise and delight customers,” Sindhwani says. The aim is not just one natural-sounding voice, but many, each matching a specific group of people it serves.
So what we hear now and in the near future are manipulated human voices, chosen for us by the people who create them: a voice-only Frankenstein, mostly limited to repeating your grocery list.
Alexa’s voice is also being programmed to be context aware — it can speak differently depending on the setting. “We have used context to make Alexa’s decision-making smarter… even beyond recognizing and understanding words,” Sindhwani says. This ability to vary speaking style based on context (listen to examples here) matters — how we speak to our fathers, during a presentation, or to our boss varies naturally. A really smart voice should do the same. The Amazon team is getting closer: Alexa can even understand when it’s being whispered to — and will whisper back.
As these voices get better, it’s important for the system not to trick you. You want a signal to the listener that it’s a robot.
We are still on that edge before fake voices seriously compete with the real. Tech’s vociferous mashup is still relatively easy to pick out as faux. (Here’s a fun way to test your “is it human?” ear.) IBM’s Project Debater — whip-smart as it is in arguing in classic debate style — can only debate. Alexa does its best to respond to general conversation but fails when it’s pushed beyond a certain set of what Amazon calls “skills.” Similarly, Google Assistant responds to “actions.” In either case, it falls on the human to learn how to speak to the machine.
Still, for all those complications, experts believe we’re just a few breakthroughs away from computers that can converse with humans. Getting there will solve a host of technological issues but will introduce just as many legal and ethical ones. When Google first demoed its new Duplex technology last year, it was a remarkable moment: The Google Assistant voice was so natural sounding when it called and asked for a salon appointment and made a dinner reservation — two tasks it had been deeply trained to carry out — that the audience delighted… and freaked out. Zeynep Tufekci, a professor at the University of North Carolina at Chapel Hill who studies tech’s social impacts, called it “deceitful” and “so obviously wrong” on Twitter. She was far from the only one disturbed by the fact that the harried human worker on the other end of the phone seemed to have no idea they were talking to a machine. It’s a breakthrough potentially ripe for abuse.
Google said it will be “designing this feature with disclosure built in, and we’ll make sure the system is appropriately identified” — and in its very earliest iteration (calling select restaurants for reservations), it appears to do so. IBM’s Andy Aaron sees this as a positive step. “As these voices get better, it’s important for the system not to trick you,” he says. For all the effort to make a voice that talks like a person, “You want a signal to the listener that it’s a robot.”