Voice Activity Detection for Voice User Interface.

Rudy BARAGLIA
Jun 20, 2018 · 11 min read
Image for post
Image for post

As a part of a R&D team at Linagora, I have been working on several Speech based technologies involving Voice Activity Detection (VAD) for different projects such as OpenPaaS:NG to develop an active speaker detection algorithm or within the Linto project (The open-source intelligent meeting assistant) to detect Wake-Up-Word and vocal activity.

Speech is the the most natural and fundamental mean of communication that we (humans) use everyday to exchange information. Furthermore with an average of 100 to 160 words spoken per minute this is the most efficient way to share data - far exceeding typing (~40 words per minute).

The fastest typers using stenotype can reach up to 360 WPM when the fastest speaker can reach up to 630 WPM (That’s a lot of words !).

For a long time, since the beginning of the computer era to a few years ago, computer interactions were only achieved through your fingers. This paradigm changed now that technology and computational power became sufficient enough and affordable enough to allow real-time processing of signal such as speech or video-feed.

Nowadays Voice User Interface are beginning to spread to most of our daily devices like smartphone (e.g. Cortana, Siri, Ok Google, …), personal assistant (e.g. Google Home, Amazon Echo, …) or Interactive voice Response (bank, answering machine, …), with rather good performances. They work pretty much the same way: When the device ears a specific Wake-Up-Word, it captures the audio-feed, enhances it, determines if it is speech or not, then transforms the raw signal to a much relevant representation to be processed by a Speech-To-Text engine which outputs a transcription.

Let’s start with some basics in order to understand the difficulties of Voice Activity Detection.

Disclaimer: This is not a complete coverage of the matter, just some leads on existing approaches and methods that we successfully used in our projects.

Signal & Features

Image for post
Image for post
Introduction of Don Juan bragging about the virtues of tobacco.

Here it’s me pronouncing the first few sentences of Don Juan in French punctuated with various noises. We can notice that it is almost possible to guess what is speech from what’s not. This is because the background noise is very low which is the opportunity to introduce a metric: the Signal to Noise Ratio.

Signal to Noise Ratio.

Signal-to-noise ratio (abbreviated SNR or S/N) is a measure used in science and engineering that compares the level of a desired signal to the level of background noise.
SNR is defined as the ratio of signal power to the noise power, often expressed in decibels. A ratio higher than 1:1 (greater than 0 dB) indicates more signal than noise.
-Wikipedia

This metric is important and is used as reference for VAD algorithms evaluation. Let’s take a signal with a lower SNR:

Image for post
Image for post
Don Juan with additive white noise.

As you can see it becomes a little bit trickier to guess where are the speech parts and impossible when the SNR goes below 0:

Image for post
Image for post
Don Juan with more additive white noise.

In that case there is still speech and you can hear it which means your brain can separate it from the noise … which means a computer should be able to do it as well.

SNR is important to know as a specification to your VAD system because it may dictate the approach you should take to have an effective algorithm. If you know that the VAD will take place in a quiet environment you may use different features than if you are recording in a plane cockpit or in a car with opened windows.

That being said let’s go back to the signal and take a look at the features we can extract from it.

Signal Analysis

Image for post
Image for post

Window’s width impacts the precision of the output, wider window means a better accuracy on the frequency scale (See below) when narrower window are more precise on the time scale.

Sliding windows with overlapping are a way to mitigate side-effect for window that are straddling on both speech and non-speech frames.

We mainly use window of size 1024 (0.064s at 16000Hz) or 2048 (0.128s) as they are powers of 2 with overlapping equal to half the window.

Short-Term Energy.

Image for post
Image for post
Short Term enrgy

Note: In green are the speech segments annotated by hand.

It is efficient for high SNR signal but loses effectiveness when the SNR drop until it become ineffective bellow 1. It also can’t discriminate speech from noises like impact noise (dropping your pen on the table), typing, air conditioner or any noise as loud or louder than human voice.

From now on we’ll be working in an other domain. The frequency domain.

Fourrier Transform.

Image for post
Image for post
Signal to spectrum.

Spectrogram.

Image for post
Image for post
Spectrogram

We can clearly see that frequencies of human voice are not random and comply to a pattern.

Spectral Flatness Measure (SFM)

Image for post
Image for post
Spectral Flatness of both speech and non-speech.
Image for post
Image for post
Spectral Flatness (inverted)

Dominant frequency

Human voice fundamental frequency varies between 80Hz to 180Hz for male and between 160Hz to 260Hz for female.
Children fundamental frequency is around 260Hz and a crying baby around 500Hz.
This is the frequency produced by the vocal folds.

Image for post
Image for post
Dominant frequency

Spectrum Frequency Band Ratio

Image for post
Image for post
Spectrum Frequency Band Ratio

MFCC, FBANK, PLP

There are concatenations of mathematical operations aiming to reduce and compress the number of information by keeping the most relevant ones.

Image for post
Image for post
MFCC processing chain.

They return an array of values in opposition to the previous features.

Decision

Thresholds

  • How to determine the threshold ?
  • How to accommodate context variation ?
  • How many features ?

Static threshold

Static threshold are determined once and won’t vary overtime. They may be determined outside context for feature that won’t depend on environment like frequency-wise features. Or they can be determined at the beginning of the decoding on a few window that are known for being non-speech.

For example features such as Dominant Frequency or Spectrum Frequency Band Ratio can be determined once as speech frequencies won’t vary depending to the recording context.
On the other hand features such as Energy are bind to a specific context and setting a threshold may not fit real life situation (e.g. if you turn on the air conditioner mid recording you may pass the threshold for the rest of the recording).

Dynamic threshold

Dynamic threshold are threshold that will adjust over time. The goal is to accommodate changes. To do so the threshold is adjusted at every window to be higher than one classified as silence and lower than one classified as speech.

Our implementation is to compute the pondered mean between the mean of last x window of silence and the mean of the last x window of speech. For instance we use dynamic threshold for energy:

Image for post
Image for post

Decision rule using multiple features.

Decision rule can be set using AND or OR rules. For example: classified as speech if energy and fundamental frequency are higher than their respective threshold. It can also be 2 out of 3 and so on. An other approach may be to set a weight for each feature and fix a threshold on the sum.

Machine Learning

The algorithm is fed with (a lot of) labeled audio. Audio or audio features goes in one side and the label — 0 or 1 as it is a two-way classifier — goes to the other side … and you know the drill.

(It will probably be a subject for an other article as we tend to use Artificial Neural Network in more and more of our projects).

Competing Signals

Image for post
Image for post
Multiple input to …
Image for post
Image for post
… to a single output.

Smoothing

One solution may be to increase the size of the analysis window. The other approach is to do an a posteriori smoothing.

Smoothing can be define as a set of rules (here the rules we use):

  • To be considered speech there must be at least 3 consecutive windows tagged speech (192ms). It prevents short noises to be considered speech.
  • To be considered silence there must be at least 3 consecutive windows tagged silence. It prevents too much cuts into speech which impact speech rhythm.
  • If a window is considered speech the previous 3 windows and following 3 windows are considered speech. It prevents the loss of information at the beginning and at the end of a sentence.
Image for post
Image for post
VAD with the 2 out of 3 rule.

On the diagram above you can see the VAD process step by step. First sub-diagram in the original signal annotated with the true speech segment. Next 3 diagrams are features used with their respective threshold. The 4th one is the output of the 2 out of 3 features and the last one is the result after smoothing.

The negative side is that the smoothing induce a delay between the recording and the actual detection. Less than a second but it can still be problematic.

Application

Detect a single word.

For the detection of Wake-Up-Word we use the tool developed by Mycroft-AI to generate a GRU model : mycroft-precise. However the in-build tool to collect audio sample didn’t fit our requirements as we aim to collect a lot a samples from a lot a different people spread across multiple offices. So we made our own to be set on a raspberry pi with a touch-screen mounted on a terminal.

The original method of recording was:
In a console: press a key, wait a few second, say the word, press a key and repeat.
As it wasn’t really user-friendly we made a GUI with instruction, navigable with the touch-screen. Furthermore, pressing a key to signal the end of the record induced a lot of bad samples as user tended to forget to press the key immediately. Moreover, pressing the key produced a sound that can be eared on the record which is bad. So we introduced VAD to prevent the hazard of such problems happening. Because the recording takes place in a quiet room, in that case we used only energy to detect the speech.

As a result wrong samples dropped significantly.

Detect a sentence.

After the detection of the Wake-Up-Word the assistant is listening to a command and must detect when the user stop talking to stop the process. Here using energy alone wasn’t working properly because noises can append during the recording and must not be considered speech. Likewise, background noise is not known in advance. In that case we use two features to detect the end of speech:

  • Energy with a dynamic threshold calculated prior and during the process.
  • Frequency band ratio is computed during the recording as an additional condition.

With the decision rule : If silence is detected during more than 1 second, we consider that the sentence/command is over.

Conclusion

Thank you for your reading. Cheers !

P.S.: An interesting article from where I started is ‘A simple but efficient real-time voice activity detection algorithm. (LISSP 2009)’ which is worth the read.

Linagora LABS

Linagora R&D activities

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store