In Audio Narration, Does Humanity Matter?

As text-to-speech and AI technology improve, there’s still a case to be made for the power of the human voice

By Hayley Grgurich for Gyst Audio

In early May, 2018, Google CEO Sundar Pichai stood in front of a big screen and rapt festival crowd to demo a robot voice that sounds startlingly lifelike. The voice is that of the Google Assistant and the audience watched the play-by-play as it called a salon, carried on a natural-sounding conversation, and booked a haircut for its “client”, Lisa.

Watching the video, you can hear a collective gleeful giggle from the crowd as the bot tosses in conversational fillers like, “Mm-hmm” to the unsuspecting human on the other end. It feels like it’s almost showboating how lifelike it can be.

Computer generated speech technology like Google Assistant is growing rapidly, and its potential applications are vast. There’s the personal assistant use case, as championed by Google, but there’s also the opportunity to improve resources for the visually impaired, including making it easier, faster and cheaper to produce audio descriptions of the visuals in television and film, or translate written text to sound files with the click of a button.

In an era where robots can not only sound like us, but even respond like us, is there still value to using human speech in audio? Or going forward should we just ‘make Google do it’?

The science is mixed. A study published in Current Biology found that humans had the same neurological response when performing an action or watching another human perform that action. However, that neurological response was absent when humans watched a robot perform the action.

Given the selective preference for biological models when coding movement, do our brains also discriminate between human and artificial speech?

A Washington University in St. Louis study found no significant difference in listeners’ attention to content or retention of its details, whether they listened to a human voice or a computer (although retention was enhanced for everyone if they read the text while listening to the narration at the same time. Speaking of, you can listen to the human-voiced version of this article here).

Still, a study published in the Journal of Specialised Translation found that visually impaired participants showed a modest preference for human voices for audio descriptions of TV and movies when rating recordings. When asked in a post-test questionnaire if they preferred human voices or synthetic for audio descriptions generally, the bias was more pronounced: a full 81 percent said they favored human speech.

As good as it’s gotten, it seems we still can’t help but listen to computer speech and hear what’s missing. What’s missing is the soul. What’s missing is feeling. What’s missing is understanding. And maybe in more ways than one.

Arjen Stolk, a postdoctoral fellow at Berkeley’s Knight Lab for cognitive neuroscience research, described the findings of a study he co-authored on the extra information human brains use to communicate outside the bounds of pure language.

“Humans take into account what they believe they mutually know,” says Stolk. “As interaction unfolds, they continuously seek and provide evidence for this ‘mutual understanding’, thereby developing a unique and dynamic ‘shared cognitive space’ that is continuously informed by their past interactions and current context.”

“In contrast,” says Stolk, “algorithms used by virtual assistants such as Apple’s Siri are limited to information contained by the words. This information is not dependent on what we believe we both know, but on statistical regularities abstracted from many texts.”

Human communication therefore transcends the mere exchange of words; it considers a space, creates an environment and even utilizes recall from past, unrelated events to help participants fully understand not only what’s being said, but what’s meant.

It also involves something computer speech never will: lived experience. In a Mental Floss article on the art of voiceover narration, Tavia Gilbert, narrator of more than 500 audiobooks, says that just as much as vocal cues like, “she whispered” or “he shouted”, she scans new material for physical descriptions of characters to color her reads.

“I’m looking for whatever each character says about themselves or other characters, including their physical description, which affects how somebody sounds,” she says. “An elderly woman with a severely hunched back and hands that flutter like a bird will sound very different than an elderly woman who was a prima ballerina in her youth and still keeps her hair pulled back in a perfect bun.”

To companies like Medium, which produces human-voiced versions of its top articles; Gyst Audio; and Audible (which offers human narration for some titles, computer generated text-to-speech narrations for others), this extra human touch wins out. While you aren’t having a conversation with the human narrator voicing your queue of articles in the Gyst app, the company’s founder, Osa Osarenkhoe, still feels strongly that real human voices enhance the listening experience.

“Human voices are better,” Osarenkhoe says. “They’re smoother. Machine speech almost has its own dialect. When you listen to a machine read, you have to convert the words back to text in your head and then re-read it.”

Clearly, the choice between human voices and text-speech-automation isn’t an easy one for audio publishers. Text-to-speech is cheaper on a larger scale. It’s also faster, since computers don’t stumble or take sick days. Even so, until a computer can approach material with the warmth of recognition, the empathy of understanding, and the texture and timbre of a physical body, it seems for top publishers and their listening audience, there is no substitute for the real thing.


DDI Featured Data Science Courses:

*DDI may receive affiliate commission from these links. We appreciate your continued support.