Machine Learning: Voice assistants can start responding to Whisper

Published in

ML Brew

8 min readApr 23, 2019

Speech technology is a type of communication technology that enables electronic devices to recognize, analyze and understand spoken word or audio. Subfields of speech technology include speech processing, and its applications like speech recognition, speech verification, Voice Conversion (VC), real-time speech to text conversion, interactive voice response (IVR), speech synthesis and speech analytics.

Speech is more than just a signal.

Speech signal contains linguistic information with many other prosodic elements like pitch, intonation, emotion, accent, etc. Due to the fact that speech is a primary form of communication, the growth of speech technology is an important step towards harnessing unstructured voice data. To harness the power of speech, and ameliorate lives with speech technology, Machine Learning (ML) plays a vital role in developing the many speech processing algorithms. With the emergence of Artificial Intelligence (AI) solutions for enterprises, speech technology has many applications in all sectors, including law, healthcare, security, finance, enterprise, and personal use. Personal use voice assistants such as Siri, Google Home, and Amazon Alexa are devices that offer individualized speech technology experiences. These Intelligent Personal Assistants (IPAs) made our day-to-day lives easy and more comfortable. Speech processing and Natural Language Processing (NLP) using ML allow intelligent devices, such as smartphones, to interact with users via verbal language. Big companies like Apple, Google, Amazon, Microsoft are doing a remarkable job to improve the voice assistants. However, these IPAs currently are facing the problem in recognizing whispered speech. We tried to solve this problem to some extent using ML-based algorithms. Our research work:

“Novel MMSE DiscoGAN for Cross-Domain Whisper-to-Speech Conversion”, published in Machine Learning in Speech and Language Processing (MLSLP), Google, Hyderabad, India, September 7, 2018. (See our research work here).

In my point of view, Machine Learning is all about the manipulating data in such a way that the machine can learn what we want. To apply ML-based algorithms, understanding regarding the nature of data is most important. If you understand the data, you can do miracles. In enterprises, applications of products increase the excitement level towards solving the problems using ML. First, I talked about the applications of WHiSPer-to-SPeeCH (WHSP2SPCH) conversion in this blog. Second, I focused on the nature of the data we have. That is about understanding the properties of whispered and normal speeches. Then, I talked about why and how we used the ML-based algorithm to solve this conversion problem.

Applications

Key reasons for whispering could be to make the conversation private, conversation in quiet environments like a library, a hospital, a meeting room, and for various forensic applications. In addition, people suffering from vocal fold-related diseases can sometimes produce whispered speech. In such cases, making the whispered speech more intelligible in order to improve the quality of communication (in person or over the telephone), WHSP2SPCH conversion technique is required. To make this happen using ML, we have to understand the properties of both the speeches.

Whispered vs. Normal Speech

Though whispered and normal speech are the modes of communication, they are different w.r.t speech production and perception perspective. The differences between whispered and normal speech include the following:

A complete absence of periodic excitation or harmonic structure in whispered speech;
A shift for the lower formant locations;
A change in the overall spectral slope (becomes flatter, with a reduction from the low-frequency region);
A shifting of the boundaries of vowel regions in the F1-F2 frequency space;
A change in both energy and duration characteristics.

Noise excitation in the whispered speech is normally distributed across the lower portion of the vocal tract, that results in 20 dB reduction of the power than its equivalent normal speech. Moreover, it has also been observed that cortical hemodynamic response was more profound due to the weaker stimulus for the whispered speech than its normal speech counterpart. Technically, whispered speech is unvoiced or aperiodic. However, it has been observed that the sensation of the pitch still exists, which is encapsulated in an intricate way in the whispered speech. Hence, we predicted fundamental frequency (F0) from the cepstral features (i.e., Mel Cepstral Coefficients) of the converted normal speech instead of predicting directly from the cepstral features of the whispered speech.

It is very natural for humans to acknowledge cross-domain relationships so easily due to their efficient perception mechanism. However, it is difficult for machines to achieve the same ability. In other words, finding a mapping function from one domain to the other can be thought as generating an image in one domain given another image in the other domain. Recently, Generative Adversarial Network (GAN)-based architectures have become more popular for discovering such cross-domain relationships. However, traditional GANs fail to perform efficiently when it comes to cross-domain relationships. This problem also brings an interesting challenge from a learning point of view in both domains, Computer Vision and Speech technology.

Key Limitations and Solutions

The key limitation of the vanilla GAN-based system is in generating the samples that may not correspond to the given input. To address this issue, we recently proposed to use Mean Squared Error (MSE) as a regularizer to the vanilla GAN (i.e., MMSE-GAN) in this paper. Moreover, traditional GANs perform better when we use explicitaly paired data. Traditional GANs fail to maintain cross-domain relationships between data. To solve this problem, we need to change training method of traditional GANs. So, we propose to apply MMSE-GAN and its extension in the form of Discover GAN (i.e., MMSE DiscoGAN), to learn the cross-domain relations (w.r.t. attributes of the speech production mechanism) between the whispered and the normal speech.

Overview of the Algorithm

We have used in total 40 speakers’ whispered and its corresponding normal speech data from the two databases, namely, the CHAracterizing INdividual Speakers (CHAINS) Speech Corpus and the Electromyographic (EMG)-UKA Trail corpus. From both the databases, total 1302 and 108 utterances have been taken for the training and the testing, respectively. 25-D Mel Cepstral Coefficients (MCC) (including the 0th coefficient) and 1-D F0 per frame (with 25 ms frame duration, and 5 ms frameshift) have been extracted using AHOCODER. However, one of the main issue before learning the mapping function for the WHSP2SPCH conversion system is the time-alignment between whispered and its corresponding normal speech. To that effect, we have used the Dynamic Time Warping (DTW) algorithm. For more about architecture details, please refer to our paper.

MMSE DiscoGAN

This architecture is used to learn 1) mapping between the cepstral features corresponding to the whispered (Xw) and the normal speech (Xs), and 2) the mapping between the converted cepstral features and the corresponding F0 of the normal speech.

Discover Generative Adversarial Network (DiscoGAN). W: whispered speech, S: normal speech.

In particular, we extended MMSE-GAN via MMSE DiscoGAN by including two generators Gws and Gsw. Gws mainly converts the parameters correspond- ing to the whispered speech (Xw) into Xws = Gws(Xw), such that Xws is indistinguishable from the parameters of the normal speech (Xs). Our model also contains two discriminators, Dw and Ds. The discriminator Dw attempts to distinguish between Xw from distribution of whispered speech parameters (i.e., pw) and converted whispered speech parameters (i.e., Xsw = Gsw(Xs)) obtained by converting Xs from distribution of normal speech parameters (i.e., ps) via generator Gsw. Ds performs an analogous operation for the normal speech parameters (Xs). To map the whispered speech parameters to normal speech parameters, we rely on the regularized adversarial objective function using MSE loss and defined as:

Adversarial Loss:

L_G: Generator Loss, L_D: Discriminator Loss

Here, Gws, Gsw, Dw, and Ds must be jointly trained, with one significant modification of including two reconstruction losses Lw and Ls. This can be mathematically represented as:

Reconstruction Loss:

L_W: loss for reconstructing whispered speech, L_S: loss for reconstructing normal speech

The minimization of the reconstruction loss, on the conversion of the whispered speech parameters into normal speech parameters, enforces the generated whispered speech parameters to be close to the original whispered speech parameters. The minimization of adversarial loss, on the conversion of whispered speech parameters into normal speech parameters, enforces the generated normal speech parameters to be close to the original normal speech parameters. These two properties are explored to encourage the one-to-one mapping between two domains. (Detailed training process)

You can see the implementation of our work on below GitHub link:

Mihir3011/whisper-to-speech

The outcome of this project makes voice assistants more accurate towards whisper. - Mihir3009/whisper-to-speech

github.com

Results

We analyze MMSE DiscoGAN architecture in terms of objective and subjective results. Root Mean Square Error (RMSE) of log(F0) is used as an objective measure since the key goal is to accurately predict F0 from the whispered speech. The more detailed analysis presented in our paper.

Subjective results are presented on the below site:

Audio Samples

Click on this to listen to converted waveforms

Final Comments

WHSP2SPCH conversion can make IPAs or voice assistants more efficient in terms of recognizing the whisper.
This idea ameliorates the lives of patients who suffer from whispered speech pathological problems.
This idea also provides a brief overview of using ML in the domain of speech technology.
In the future, the use of high-quality vocoders, such as WORLD or Wavenet can further improve the voice quality of converted speech signal.

Acknowledgments

Special thanks to Kim et. al. for bringing the idea of DiscoGAN in the Computer Vision (CV) Domain. Thanks to Google for giving us a platform in terms of Machine Learning in Speech and Language Processing (MLSLP) to publish our research work. Also, the author would like to thanks to the Speech Research Lab, DA-IICT, Gandhinagar for giving the resources and venue to conduct the experiments.