In voice interaction, slow = stupid

When it comes to voice interaction, if there’s a large delay in the response from a device, we associate this with a lack of intelligence. Anything more than a second and we start to grow impatient, tap our toes, or roll our eyes.

However, we shouldn’t be too critical of ourselves. If a response takes more than a second, it really defeats the convenience and argument around voice interaction.

There are several areas where latency can creep into an interaction:

Slow trigger time. The wake up word takes too long to react because the sampling period is too long.

Slow processing. Too many applications are running on the device locally for things like audio streaming to open quickly.

Local network latency. Older routers can have issues when multiple devices are using the same router. Also, WiFi signal can have a big impact.

Slow Internet connection. Especially if multiple trips are involved in the voice service.

Slow ASR / TTS / NLU service. Sometimes traffic on these sites can lead to latency.

And sometimes it’s all of the above. However, there are some ways of mitigating these slowdowns:

  • Acknowledging when the device has been triggered through lights, sound, or voice
  • Acknowledging the end of speech detection
  • Playing a canned local response if the server response takes more than 1 second
  • Implementing local secondary trigger words for common commands
  • Playing lights or sound while awaiting response

These will help make voice based hardware devices seem less dumb.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.