Authentication & authorization in voice driven conversational applications

Trust between two conversing entities is knowing who you are talking to & if they are the appropriate audience.

In traditional software trust is accomplished by user authentication & authorization. Typically employing variety of methods such as username & passwords, two factor authentication, role based access controls. In Graphical UX these mechanisms are well established.

But with emerging Conversational UX, how would one accomplish the same. Most devices have some mechanism for device level authentication & authorization (account linking, oauth etc) but how to ascertain that the one speaking is to be trusted.


There are few use cases that can explored with varying degree of trust.

  • Location Driven Trust
  • Voice Recognized
  • Temporary Passwords
  • Image Recognized

Location Driven Trust

This typically would involve device to be placed in a known location where device is secured by external physical security. A use case could be conference rooms in office say or waiting areas at a mechanic. This would not ensure authentication of users but provide authorization to a pool of users, who all can access some common information such as average wait time at the shop etc but cannot do anything else.

Voice recognized

Such a mechanism would involve recognizing of voice of user by the application to trust them. This has few interesting points

  1. Voice model of users would have to be built based on different modulations in different scenarios, background noise etc
  2. Relies on devices to natively support it

Given this cost of user onboarding would be relatively high & unless supported by device provider natively, it will be difficult to implement. There are some interesting 3rd party providers who can help but that would be an added component.

Temporary passwords

Relying on good old authentication mechanism, here the voice driven application would generate temporary tokens or passwords which can be associated with user’s identity. Access to these can be granted using say a RSA ID like device or some app. This method can be useful in short duration transactions such as making a payment, changing an appointment or preparing a reservation.

Image Recognized

In human interactions, people recognize & trust what they see.

This would involve use of an additive device that can be triggered when user interacts with say Google Home or Echo to capture picture which then be used to recognize the user or capture their motion to establish trust.

Like voice based method it would also require training model of users images. But unlike voice based training this would be simpler given that it would not burden users in recording audio & their is enough data set available via social media or other photo applications. This method provides greater degree of trust, given the application user is conversing with, can trigger continued checks & can be used for long running transactions.


These are suggestions on how security could be approached in voice driven applications. With a new medium it opens many challenges & opportunities, in contrast to visual UX driven applications.