How We Built a Smart Voice Activity Detection System Using Adaptive Custom Language Models

Dario Cazzani
Cisco Emerge
Published in
3 min readMay 31, 2017

The problem

Digital assistants, bots, and devices configured to interface with your voice need to know when a command has been spoken in its entirety in order to process the command and provide proper feedback to the user.

Voice Activity Detection (VAD) systems determine when a user has finished speaking. However, VAD systems are oblivious to the actual words being spoken and instead base their determination on recognizing whether the analyzed sound is speech (as opposed to non-speech noise).

A bit of background: Language Models

Formally, but without any math, we can define a language model as a function that determines the probability of a sequence of words.
Without knowing what a function is, picture a Language Model as a black box that takes any sequence of words as input and provides a number that describes how likely this particular sequence can happen during a conversation in English.

It is possible to use a language model to determine whether a given sentence is complete or not.
As an example, Sentence 1 is a complete sentence, while Sentence 2 is not:

  1. The cat ran down the street.
  2. The dog walks down the.

The current solution

Conventional techniques take advantage of the language model used in speech-to-text engines to detect whether the spoken utterance is a complete sentence.

The amount of time required to wait for non-speech (VAD_PATIENCE), before processing the utterance, is shortened or lengthened accordingly. However, these techniques do not take into account that different users have different ways of issuing commands, and above all it does not adapt to a user’s common sentences / way of speaking.

Illustration of a conventional solution as described above. “STT Engine” stands for “speech-to-text engine.”

A flexible and customizable solution

When building a voice interface it is possible to use third party speech-to-text engines. However, this solution does not enable a smart VAD that can adjust the VAD_PATIENCE because the language model used in the third party speech-to-text is not customizable/modifiable.

The solution we’ve come up with is to include the customizable language model in the client so that every time the user issues commands the language model may be shaped around the domain of commands to which the voice activated device needs to respond.

State-of-the-art language models use recurrent neural networks, while more portable solutions are based on n-grams and Markov models (e.g., the text predictor in mobile phones). Our portable solution permits the voice client to carry the adaptation locally, while the recurrent neural network may require the adaptation to run as a separate service.

Illustration of the solution described herein. In contrast to the conventional solution, the present solution takes advantage of the adaptive language model included in the client. This enables an adaptive and customized language model for each client/user.

Conclusion

It is possible to implement a smart and adaptive VAD even when using third party speech-to-text services that do not allow language model customization. From a user experience point of view, the dialogue between the voice activated device becomes faster and more user friendly. The time for processing the commands issued by the user is shorter and the whole experience improves as the device is used and learns which commands are issued.

About Cisco Emerge

At Cisco Emerge, we are using the latest machine learning technologies to advance the future of work.
Find out more on our website.

--

--

Dario Cazzani
Cisco Emerge

Violinist, Machine Learning Engineer and I also make good pasta