In voice interaction, slow = stupid

Published in

Social Robots

2 min readAug 5, 2016

When it comes to voice interaction, if there’s a large delay in the response from a device, we associate this with a lack of intelligence. Anything more than a second and we start to grow impatient, tap our toes, or roll our eyes.

However, we shouldn’t be too critical of ourselves. If a response takes more than a second, it really defeats the convenience and argument around voice interaction.

There are several areas where latency can creep into an interaction:

Slow trigger time. The wake up word takes too long to react because the sampling period is too long.

Slow processing. Too many applications are running on the device locally for things like audio streaming to open quickly.

Local network latency. Older routers can have issues when multiple devices are using the same router. Also, WiFi signal can have a big impact.

Slow Internet connection. Especially if multiple trips are involved in the voice service.

Slow ASR / TTS / NLU service. Sometimes traffic on these sites can lead to latency.

And sometimes it’s all of the above. However, there are some ways of mitigating these slowdowns:

Acknowledging when the device has been triggered through lights, sound, or voice
Acknowledging the end of speech detection
Playing a canned local response if the server response takes more than 1 second
Implementing local secondary trigger words for common commands
Playing lights or sound while awaiting response

These will help make voice based hardware devices seem less dumb.

In voice interaction, slow = stupid

Written by Leor Grebler