Emotion in the AIr
In his famous novel Le dernier jour d’un condamné, Victor Hugo shares that “words fail emotions”. Often subtle in their expression, sometimes difficult to define but essential, emotions are a driving force in non-verbal communication.
“The character of the human voice, under the influence of various emotions, has been discussed by Mr. Herbert Spencer in his essay on Music. He clearly shows that the voice alters much under different conditions, in loudness and in quality, that is, in resonance and timbre, pitch and intervals.”, C.Darwin.
The above passage proves the variety of non-verbal language and all its possibilities. It contributes to making communication rich in nuances, and being deprived of it makes interacting with others significantly harder.
Being able to identify the emotional state of your interlocutor allows you to easily be on the same page. It means understanding what the person is sensitive to. It’s giving yourself the opportunity to suggest them an idea and let it germinate until it matures. This is an undeniable position of strength, but one that is sometimes lacking in jobs such as telemarketing, where the distance from the interlocutor makes the task more complex.
One branch of machine learning focuses on the development of models capable of responding to this problem: SER (Speech Emotion Recognition). These models are evaluated on their ability to identify an emotional state in an oral speech independently of its content. The most common approach in the state of the art is supervised learning based on frequency analysis: train a model on spectrograms from labeled audio files.
Some people talk more, others less. Some calls are made in the quiet, others in the omnipresent hubbub of the great Parisian boulevards. Therefore, it is common to standardize the length of the audios processed by SER models and to use data augmentation techniques to teach the models to handle noisy data: randomly distributing noise on the obtained spectrograms in order to train on a larger dataset less prone to bias.
However, this approach is not without flaws and has its pitfalls. Research at Quant showed that SER models generalize poorly and struggle to produce results that are better than random on new data. Because it is resource intensive to process a large number of audio files, it is common for SER approaches to focus on a single dataset. However, for similar labeling, the most used SER datasets (CREMA-D, RAVDESS, TESS, SAVEE…) have notable differences in the way emotions are expressed: creation of biases by repetition of the same sentence (TESS, RAVDESS), strong British accent (SAVEE), time offset (RAVDESS), poor recognition rates by actual people (CREMA-D), and overall different ways of speaking and expressing the same emotions. These biases lead to significant dataset-to-dataset differences in extracted features, and applying correction techniques comes at the cost of important information loss. Speech emotion recognition in general needs to be approached from a different angle. State-of-the-art SER models simply fail to capture the core differences between emotions: it just doesn’t work.
Nonetheless, they give promising results when predicting emotion intensity and nuance! Our testing showed that these models perform surprisingly well when used not to predict your emotional state, but to notice subtle differences in the expression of similar emotions. And if someday in the future you come back home just that much sadder than usual and slightly anxious, your family won’t notice, but an SER model absolutely will.