How to Boost Emotion Recognition Performance in Speech Using Contrastive Predictive Coding

Published in

Speechmatics

14 min readSep 2, 2020

In this article I will take you through how I developed an emotion recognition system using speech as an input and then boosted performance using self-supervised representations trained with Contrastive Predictive Coding (CPC). Results have improved from a baseline of 71% to 80% accuracy when using CPC. This is a significant relative reduction in error of 30%.

View the full code here.

In addition, I benchmarked various architectures of the model trained using these representations which include simple multilayer perceptrons (MLPs), Recurrent Neural Networks (RNNs) and WaveNet style models that use dilated convolutions.

I found that a bi-directional RNN model using pre-trained CPC representations as input features was the highest performing setup that leads to 79.6% frame-wise accuracy when classifying eight emotions in the RAVDESS dataset. To the best of my knowledge, this is a very competitive system compared to others trained on this data.

Introduction

Emotion recognition from speech involves predicting someone’s emotion from a set of classes such as happy, sad, angry, etc. There are many potential applications in businesses such as call centres, health care and human resources [1]. For example, in a call centre, it would allow an automated way of discovering sentiment of potential customers to guide a sales representative towards a better sales approach.

Predicting an emotion from audio is challenging, since emotions are perceived differently from person to person and can often be difficult to interpret. In addition, many emotional cues come from areas unrelated to speech such as facial expressions, the person’s particular mentality and the context of the interaction. As humans we naturally take all of these signals into account as well as our past communication experiences before making a final judgment. Some authors improve performance using multimodal approaches where audio is combined with text [3,4] or video [5]. Ideally, a world model that understands the links between these areas and social interactions (see World Scopes in [2]) would be trained for this task. However, this is an on going area of research and it is currently unclear how to learn from social interactions, rather than just learning trends from the data itself. In this work, I boost the performance by using self-supervised representations training with a Contrastive Predictive Coding (CPC [8]) framework rather than multimodal training.

In the field of representational learning for speech, phone and speaker identification are widely used to evaluate features generated from self-supervised learning techniques since they evaluate local and global structure in the audio respectively. In this article, I demonstrate that emotion recognition can also be used as a downstream task for gauging the quality of representations. Furthermore, classifying emotions supplements phone and speaker identification when benchmarking how good representations are, since emotions only loosely depend on the words being said or how a persons voice sounds.

Related work

Emotion recognition

The majority of emotion recognition systems [3,4,6] have been trained using Mel-Frequency Cepstral Coefficients (MFCCs) which are popular audio features based on a frequency spectrogram. Fbanks, also known as Mel spectrograms, are similar to MFCCs and are widely used. Both capture the frequency content that humans are sensitive to. There has been little work showing the performance gains when using machine learned features through self-supervised learning on the emotion recognition task. It is worth noting that MFCCs and Fbanks can still be used as an input to the self-supervised task instead of raw audio and can often be a good starting point when extracting richer representations. I will talk more about that later.

Self-supervised learning

There are a variety of self-supervised techniques for speech. Self-supervised learning is ‘unsupervised’ in the sense that it takes advantage of the inherent structure of the data to generate labels. The motivation is the ability to use vast quantities of unlabelled audio data on the internet to generate general representations in a similar way that language models learn from unlabelled text data. Ideally, this leads to needing less human labelled data to get the same performance on a downstream task compared to a fully supervised approach. Less human labelled data means, for example, companies can avoid using expensive transcribers to get accurate audio transcripts for automatic speech recognition (ASR).

Relying purely on supervised learning has the perils of task specific solutions where the models may struggle to generalise across different domains, such as TV broadcasts and telephone calls, or across different noisy environments. Furthermore, supervised learning tends to ignore the rich underlying structure of audio, which self-supervised learning takes advantage of.

There are two main forms of self-supervised learning [7]:

Generative — focus on minimising a reconstruction error, therefore, the loss is measured in the output space.
Contrastive — strives to be able to single out a positive sample from a set of distractors that correspond to different segments of audio. The loss is measured in the representation space.

Popular approaches

A popular generative self-supervised approach is Autoregressive Predictive Coding (APC [9]). Once the raw audio is converted to Fbanks, the task is to predict the feature vector N time steps in the future given the features before that time step, where the range 1 ≤ N ≤ 10 gives rise to good representations. The past context is summarised by a recurrent neural network (RNN) or transformer [10] and the activations of the final layers are taken as the representations to be used. The loss is mean squared error with respect to the reference. Recent work adds a vector quantised layer [11] to improve results on phone/speaker identification further.

One form of contrastive self-supervised learning, and the one used in this work, is CPC. The raw data is encoded to a latent space and the task is to classify positive and negative samples correctly in this space. The loss used here is called InfoNCE. A more detailed summary of CPC is in the next section.

Other competitive methods include Momentum Contrast (MoCo [12]) and Problem-Agnostic Speech Encoder (PASE [13]). The latter uses emotion recognition as one of the workers to push relevant information into the representations.

Contrastive Predictive Coding

An overview of CPC is given in Figure 1 and this section describes how it works. I have provided some screenshots of PyTorch code to illustrate the method— these are simplified compared to the full code given in the project repository.

Figure 1: Overview of CPC as a representational learning approach for audio. Figure from [8].

Firstly, raw audio samples, x, at 16kHz are passed through an encoder (g_enc) which down-samples the audio by a factor of 160 with multiple convolutional layers. Therefore, the output frequency of the encoder is 100Hz. Alternatively, the encoder can instead be replaced with a multi-layer perceptron that operates on Fbank features that are already at 100Hz. The second method is used as a baseline in [9] and is used in this work since I found it gives small gains in performance when using the learnt features for downstream tasks.

The output of the encoder, z, in the latent space are fed into an autoregressive model g_ar (such as an RNN). This stage outputs c at each time step which combines the information in all previous latents. The forward method in Figure 2, illustrates how this is implemented in PyTorch.

Figure 2: Initialisation and forward pass of the CPC model. View the full code here.

Now, at a specific time step t, a linear transform of c is applied with a weight matrix associated with how far ahead you are predicting (shown by the dotted lines in Figure 1 and by the code on line 36 in Figure 3). Next, these linear transforms are multiplied by the actual future latent z to give a log density ratio. The density ratio is defined in the equation below:

A Softmax layer is applied over the log density ratio from a positive sample and many negative samples. The aim is to increase the probability of the positive sample or, in other-words, be able to understand that the positive latent is the one that is most related to the history. This is repeated k times as shown by the loop over each time step in Figure 3.

“The training objective requires identifying the correct quantized latent audio representation in a set of distractors” — excerpt from wav2vec 2.0 [14]

The loss in CPC is known as InfoNCE given in the equation below and it is the same as the categorical cross-entropy of classifying the positive sample correctly.

In practice the loss is calculated by summing the log probability of the positive samples within the batch (line 57 in Figure 3). The encoder, RNN and weight matrices are all trained in parallel when maximising this loss.

Figure 3: Illustration of how the NCE loss is calculated given z and c. View the full code here.

The code given in Figure 4, gives an example of the model being initialised, how data is passed through and calculating InfoNCE loss at multiple times in the sequence. It is worth understanding that there are loops over the time ‘t’ in the sequence to predict from and also the number of time steps ‘k’ to predict ahead for.

Figure 4: Passing data through the model and calculating the final loss

Datasets

The CPC pre-training is trained on a 100-hour subset of the Librispeech dataset [17] comprising of 16kHz English speech.

The dataset used for the emotion recognition task is called “The Ryerson Audio-Visual Database of Emotional Speech and Song” (RAVDESS, [16]). Only the speech dataset is considered in my work. It comprises of 24 speakers with an even split of male and female actors. There are eight emotions used to read out specific sentences, namely: neutral, calm, happy, sad, angry, fearful, surprise, and disgust.

I chose to split the final two actors evenly between the validation and test set. In addition, audio files are randomly chosen from the other actors and added to ensure 80% of the data is used in the training set, leading to a classic 80:10:10 split.

Related work in [6] uses the RAVDESS data and chooses to classify male and female emotions separately. Furthermore, they illustrate results when limiting the number of emotion classes, which makes the task easier. In this work, I kept the task true to the original data, so the models are trained to classify all eight emotions.

Method

CPC system

The self supervised pre-training step used to generate features for emotion recognition training is outlined in [9] as the CPC baseline. This is used instead of the convolutional feature encoder that operates on the raw waveform as in [8] since I found it marginally improves results. The standard 80 dimensional Fbanks were used as input features, which are passed through a 3 layer MLP encoder with a hidden size of 512, batch norm and ReLU activations. The output from the feature encoder (z) is fed through a single GRU layer with an output size of 256 to give the contextual feature vectors (c). These are used as representations for training the emotion recognition model.

The CPC system was trained with a window size of 128 (corresponding to 1.28s since Fbanks are at 100Hz), a batch size of 64 and 500k steps. This amounts to approximately 114 epochs of the Librispeech 100 hour dataset (used in [8]). The RAdam optimizer was used with a flat learning rate of 4e-4 for the first two thirds of training before being cosine annealed to 1e-6. A total horizon of 12 timesteps in the future (k) was used since it has been shown to give the highest accuracy of discriminating the positive from the negative samples in the CPC task [8].

Emotion recognition system

In addition, I used a variety of architectures for the emotion recognition model in order to investigate the accessibility of representations as well as pushing the performance of the system. The list below gives more details on the 7 types of architecture used. All models have global normalisation applied to the input features and the code can be found here.

Linear — single linear layer.
MLP-2 — multi layer perceptron with 2 blocks. Each block contains a linear layer (hidden size 1024), batch norm, ReLU activation and dropout (prob 0.1).
MLP-4 — same as above but with 4 blocks.
RNN (uni-dir) — 2 layers, not bi-directional, hidden size 512, dropout prob 0.1
Convolutional — same structure as [6], with 6 convolutional layers, ReLU activations, dropout prob 0.1 and a max pooling layer.
WaveNet — dilated convolutional structure that grows exponentially as in [15]. Hyper parameters are hidden size 64, dilation depth 6, number of repeats 5, kernel size 2.
RNN (bi-dir) — same as RNN but bi-directional.

Each model was trained using a window size of 1024 (10.24s), batch size of 8 and a total of 40k steps. A frame-wise cross entropy loss was used over the eight emotions. The optimizer and learning rate is unchanged compared to CPC training, however, the learning rate schedule off is turned off.

A baseline emotion recognition model without CPC pre-training, that uses Fbanks as feature vectors, is used for comparison throughout my analysis.

Results

Impact of CPC

A linear architecture is often used in self-supervised learning literature to illustrate the accessibility of representations. In this work, I wanted to show that there is a boost even for more complex architectures, such as a WaveNet style model with dilated convolutions or a bi-directional RNN. Table 1 shows the frame-wise accuracy when using Fbanks and CPC features for each of the proposed architectures. In each case, CPC features result in an increase in accuracy when classifying the emotions in speech, irrespective of the architecture. The average decrease in relative error is 21.7%; in other words over a fifth of the errors disappear.

Table 1: Frame-wise accuracy of classifying eight emotions from the RAVDESS dataset with a variety of model architectures when using CPC features instead of Fbanks.

It is worth noting that, since the CPC representations have a larger feature dimension compared to Fbanks, emotion recognition models trained with CPC have an increased parameter count. However, after running some tests matching parameter counts, I established that the trend of Fbanks being outperformed still holds and the gap only narrowed a small amount.

Impact of architecture

The three worst performing models in Table 1 do not utilise information across time — they attempt to classify the emotion given the representation of one frame only. Models that use a uni directional RNN or convolutional layers can take extra context into account, which makes a large difference, especially when using Fbanks. The WaveNet style model has a much larger receptive field compared to the normal convolutional model and this boosts performance further. One reason could be because it can look into the future since the convolutions are unmasked. Similarly to the WaveNet model, the bi-directional RNN can use context from the future and this architecture leads to the best emotion recognition performance when coupled with the CPC features. The frame-wise accuracy was 79.6% on the RAVDESS test set. As far as I am aware, this is state of the art for this task when classifying all eight emotions.

Individual emotions

Table 2 shows the frame-wise F1 scores for each of the emotions classified in the test set. The model is best at recognising the actors speaking in disgusted and surprised voices. Happy and neutral are the emotions it performs the worst on. This is likely because the latter are less expressive so the model finds it more difficult to classify.

Table 2: F1 scores for each emotion in the RAVDESS dataset achieved by using the RNN (bi-dir) model

Future work

Future work could include swapping out the RNN in the CPC system with a transformer (this is done in [14]). This would allow me to scale up the CPC model and utilise more unlabelled data from a range of sources beyond Librispeech. In addition, data augmentation could be added to the emotion recognition data to improve robustness and perhaps boost results further.

Conclusion

Self-supervised learning, such as CPC, can be used to significantly reduce errors in the domain of emotion recognition in speech. A variety of architectures were tested in my work and I found the bi-directional RNN — which can utilise future context — led to the best performing model.

This work is useful for benchmarking and improving speech representations trained using CPC, as well as boosting performance when classifying multiple emotions. A reason why all this is exciting is that it provides the building blocks for systems that can more reliably predict the sentiment of a person speaking. For example, this can lead to significantly improved quality of analysis tools in call centres used to help agents build their skills and improve the customer experience.

References

[1] Ghosal, D., Majumder, N., Poria, S., Chhaya, N. and Gelbukh, A., 2019. Dialoguegcn: A graph convolutional neural network for emotion recognition in conversation. arXiv preprint arXiv:1908.11540.

[2] Bisk, Y., Holtzman, A., Thomason, J., Andreas, J., Bengio, Y., Chai, J., Lapata, M., Lazaridou, A., May, J., Nisnevich, A. and Pinto, N., 2020. Experience grounds language. arXiv preprint arXiv:2004.10151.

[3] Cai, L., Hu, Y., Dong, J. and Zhou, S., 2019. Audio-Textual Emotion Recognition Based on Improved Neural Networks. Mathematical Problems in Engineering, 2019.

[4] Yoon, S., Byun, S. and Jung, K., 2018, December. Multimodal speech emotion recognition using audio and text. In 2018 IEEE Spoken Language Technology Workshop (SLT) (pp. 112–118). IEEE.

[5] Ortega, J.D., Cardinal, P. and Koerich, A.L., 2019, October. Emotion recognition using fusion of audio and video features. In 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC) (pp. 3847–3852). IEEE.

[6] Chu, R., 2019. Speech Emotion Recognition with Convolutional Neural Network. https://towardsdatascience.com/speech-emotion-recognition-with-convolution-neural-network-1e6bb7130ce3

[7] Anand A., 2020. Contrastive Self-Supervised Learning. https://ankeshanand.com/blog/2020/01/26/contrative-self-supervised-learning.html

[8] Oord, A.V.D., Li, Y. and Vinyals, O., 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.

[9] Chung, Y.A., Hsu, W.N., Tang, H. and Glass, J., 2019. An unsupervised autoregressive model for speech representation learning. arXiv preprint arXiv:1904.03240.

[10] Chung, Y.A. and Glass, J., 2020, May. Generative pre-training for speech with autoregressive predictive coding. In ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 3497–3501). IEEE.

[11] Chung, Y.A., Tang, H. and Glass, J., 2020. Vector-Quantized Autoregressive Predictive Coding. arXiv preprint arXiv:2005.08392.

[12] Ding, K., He, X. and Wan, G., 2020. Learning Speaker Embedding with Momentum Contrast. arXiv preprint arXiv:2001.01986.

[13] Pascual, S., Ravanelli, M., Serrà, J., Bonafonte, A. and Bengio, Y., 2019. Learning problem-agnostic speech representations from multiple self-supervised tasks. arXiv preprint arXiv:1904.03416.

[14] Baevski, A., Zhou, H., Mohamed, A. and Auli, M., 2020. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. arXiv preprint arXiv:2006.11477.

[15] Oord, A.V.D., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. and Kavukcuoglu, K., 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.

[16] Livingstone SR, Russo FA (2018) The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5): e0196391.

[17] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pages 5206–5210. IEEE, 2015.