The Era of Voice: From Keyboards to Vocal Cords

By 2018, 30% of our interactions with technology will be through “conversations” with smart machines. Product leaders at technology and service providers need to invest now to improve currently limited voice interfaces.
— Gartner

W e used to call them dumb and neglect them in isolated rooms. We now call them smart and interact with them in our living rooms. This Cinderella story is no fairy tale. It’s the evolution towards the charmed existence of modern machines. The smarts are derived from artificial intelligence and powered by the cloud. While the interactions are diverging from keyboards to vocal cords.

The era of voice is the result of a confluence of factors:

  • On average, humans can speak 150 words per minute vs. type 40 words per minute.
  • Automatic speech recognition is approaching a 95% accuracy rate. By comparison, humans miss ~5% of words in a conversation.
  • The proliferation of devices with microphones and speakers paired with low-latency cloud computing.
  • Digital assistants (e.g. Siri, Alexa) have replaced robotic speech with life-like speech.
Voice is now a simpler, more convenient, and faster alternative to the keyboard.

Voice-First

“Design for the ear not the eye — throw away what we know about design today and start fresh…We need to focus on how things sound, not how they look.”
— Paul Cutsinger, Amazon Developer Evangelist

Devices designed for voice input/output are growing at an exponential rate.

From hearables (Apple AirPods) to smart speakers (Amazon Echo), voice-first hardware is everywhere.

Speech Design

As we shift from graphical workstations to conversations, end user experience remains paramount. But without screens, how can we delight through a voice user interface (VUI)?

Here are some speech design considerations:

  • Add breaks in the speech for dramatic effect
  • Vary volume, rate, and pitch to generate natural sounding speech
  • Personalize speech with regional dialect pronunciation
8 People Test Their Accents on Siri, Echo and Google Home

With a 4.9% word error rate…

Google I/O 2017

…voice-first devices have yet to master the input of accents, but can they output a convincing accent?

For the answer to this question, discover how we can teach Alexa to speak with a Boston accent:

Voice-Next

“This year, 35.6 million Americans will use a voice-activated assistant device at least once a month.”
— eMarketer

Currently, the Amazon Echo is primarily used for simple tasks, such as setting a timer/alarm, playing a song, and reading the news.

The initial Amazon Echo smart speaker operates most effectively in a hands-free or eyes-free environment.

However, the next iterations of the Echo (Echo Show and Echo Look) will incorporate:

  • A screen to visually display voice responses and enable video chat in the Echo Show
  • A camera that will initially serve as a styling assistant in the Echo Look
  • The ability to opt-in to Push notifications to proactively alert users will come to all Echo versions

Hollywood Influence

Speaking to a machine for assistance is nothing new. Hollywood has been indoctrinating us for decades.

  • Digital Assitants: “2001: A Space Odyssey” (HAL 9000), “Her” (Samantha), “Marvel’s Avengers” (Jarvis)
  • Mobile Machines: “Star Wars” (R2-D2, C-3PO), “Knight Rider” (KITT), “Short Circuit” (Johnny 5), “Wall-E”, “Avengers” (Ultron)
  • Organic Hybrids: “Star Trek” (Data), “Blade Runner” (Replicants), “Prometheus” (David), “Marvel’s Avengers” (Vision), “Terminator”, “Ex Machina”, “I, Robot” (Sunny), Ghost in the Shell (Major)

In a world where humans and machines coexist, the line between natural speech and synthetic speech becomes blurred.

While this may prove a challenge for some…

SNL’s Echo Silver parody touches upon complications related to volume and personalization.

…most will adopt and embrace our robot assistants. Digital natives will lead the charge.

Today’s children expect to swipe on screens and converse with machines.
Hi Robot…I love you Robot

Towards Ambient Computing

“As speech-recognition accuracy goes from 95% to 99%, we’ll go from barely using it to using all the time!”
— Andrew Ng, co-founder/chairman of Coursera and former chief scientist at Baidu

The evolution of human-computer interactions is just beginning. As technological innovations bring about exponential improvements, synthetic interactions will become indistinguishable from human counterparts.

ThoughtWorks — Interact or Die Trying