A simple technique for fairer speech emotion recognition

3 min readSep 3, 2019

Cogito’s algorithms are used to generate real-time feedback and guidance for call center agents, helping them become more self-aware and empathetic in how they approach and handle customer conversations. This promotes positive behavior change, leading to improvements in key performance indicators for companies across various industries. In order to deliver such a system at scale, it is essential to ensure that the machine learning models powering the product are fair across a large set of speakers from a wide variety of backgrounds. As a result, the topic of mitigating unfairness, or bias, is of central concern to our machine learning research and development efforts.

This September at the Interspeech 2019 conference in Graz, Austria — Cogito researchers will present a paper which looks at the phenomenon of gender bias in speech emotion recognition. We propose a simple model training procedure which is both effective at mitigating bias and is more stable during training than a highly cited baseline method.

By using a very standard neural network model, based on 2D convolutional layers applied to Mel frequency coefficients, trained to recognize emotional activation on a dataset of 33,000+ naturally occurring utterances from radio shows — we demonstrate that model performance is more favorable for male speakers compared to females. A popular de-biasing approach, previously proposed by researchers at Google and Stanford (see paper), is found to be effective at improving fairness across gender but at the cost of introducing highly unstable model training and reduced accuracy. In the figure below this “adversarial” approach produces a wide variance in accuracy metrics across multiple training rounds. Our method (referred to as “non-adversarial” in the below figure), on the other hand, produces consistently high accuracy while at the same time mitigating the effects of gender bias.

Figure: Distributions of validation set concordance correlation coefficient (CCC) for recognizing emotional activation plotted as a function of de-biasing method. Distributions are produced by simply varying the random seed used for weight initialization and stochastic training operations.

A promising aspect of our proposed training procedure is that it can be applied to arbitrary variables, therefore serving as an effective strategy for mitigating bias against a host of sensitive demographic categories beyond gender.

Of course, machine learning techniques are only one part of the solution. Consequently, our research continues to focus on developing more holistic protocols which encompass data collection, human annotation and production monitoring of model accuracy, in order to deliver fairer machine learning models at scale.

Cogito at Interspeech

If you would like to learn more about the machine learning research and development at Cogito, our researchers at Interspeech will be happy to meet for a coffee and a chat. Cogito is also a proud sponsor of the Workshop for Young Female Researchers in Speech Science & Technology (YFRSW) again this year, which runs as a satellite workshop in Graz just before Interspeech and is well worth attending.

Related articles

● Attention-based Sequence Classification for Affect Detection — Interspeech 2019 paper

● Towards trustworthy signal processing and machine learning — ICASSP 2019 conference summary

● Redefining signal processing for audio and speech technologies — ICASSP 2018 conference summary

● Robots, Deep Neural Networks and the Future of Speech — Interspeech 2017 conference summary

● Adversarial de-biasing — GitHub repository

A simple technique for fairer speech emotion recognition

Written by Cogito Corporation