Using Machine Learning To Derive Emotions From Speech

Frederik Calsius
jstack.eu
Published in
4 min readSep 10, 2019

In a previous post we already discussed how to use PoseNet and the tf-js Face-API to get an understanding of different body postures and the facial expressions that people take on. In this post we’ll discuss how we can classify a person’s emotions, based on what their word choice.

Project Introduction

As mentioned in the last post, the idea of the systems that are proposed in this series will be combined to create a system that can assist people in the preparation of their job interview. After body language, verbal communication is considered the most important form of communication. Therefor, it is important to know that we are communicating our messages in a positive manner.

In this post, we’ll discuss the approach that we’ve taken to find out if the message of the speaker is positive, negative or neutral.

Positive, neutral, negative

Emotions From Speech

Deriving human emotions through speech has been a popular field of research in these recent years. This is because humans tend to be very good at recognizing emotions in other people their voices. Most often, researchers distinguish between happiness, sadness, fear and anger. However, for the purpose of this project, we think it is sufficient to find out if somebody is passing his/her message in a positive, negative or neutral manner. The classification of these 3 categories will happen, based on the word choice of the human subject.

Difficulties In Deriving Emotions From Speech

Even though humans are good at picking up emotions from somebodies voice, it is a lot harder for computers to do so. This is because; first you’d need to know what features to look for, and secondly to find out how to measure them and what they mean. Since everybody has unique vocal tracts, these speech signals can vary a lot. Secondly, it’s also more difficult to deal with noise, or to get a clear signal from just one person when multiple people in the same room are speaking.

To stay within the scope of job interviews, we will assume that our audio recording will contain multiple speakers. To distinguish between different speakers, a method called speaker diarization is used. To make the system as portable as possible, it was important that the system did not need any training in order to recognize somebodies voice specifically.

IBM speech-to-text

Since we didn’t want to re-invent the wheel if it wasn’t necessary, we went to look online if there were already some pre-made solutions. IBM offers a speech-to-text solution that does exactly what we need. This means it takes care of the speaker diarization and the speech-to-text

Emotions From Text

Once we have received the text transcript through the IBM api, we can use this transcript to analyse if the speaker is using positive, neutral or negative words.

In order to find this out, a logistic regression model is used. This model is trained on the twitter airline dataset. This isn’t a perfect dataset for our case, since people tend to tweet differently than they speak. However, since this dataset came cleaned and prepared for sentiment analysis, it was sufficient for our proof of concept.

Logistic Regression

Even though logistic regression is actually a binary classifier, it works perfect for our case, since we’re dealing with mutually exclusive classes.

If you want to know more about how logistic regression for multiple classes work, I’d highly advice this video from Andrew - the legend - Ng. The approach that we have taken with the multiple regression classification is the so called one-vs-rest approach. With this approach, you consider one class as positive, and all the other classes as negative examples. If your example belongs to the positive class, you know how to classify it. However, if your example belongs to the negative class, you try the same for the other classes.

One vs. All Logistic Regression

Recap of emotional analysis from text

For this module, most of the heavy lifting has been done by the IBM speech-to-text api. This API allows for speaker diarization (recognizing who is speaking when), next to that, it also takes care of the transcription of the audio-file.

For the actual emotional analysis, we programmed a multi-class logistic regression model that uses the one versus all approach. Logistic regression is a simple algorithm, but it works very well for our case.

--

--