Photo by Domingo Alvarez E on Unsplash

Do Machines Understand Human Emotions?šŸ¤Ø

Kuntal Das
The Startup
Published in
6 min readJan 9, 2021

--

Well, I would say ā€œThey canā€. You can argue but can you prove that they donā€™t? In this post, Iā€™ll show you how to make a machine understand our emotions šŸ˜ƒ with PyTorch.

When Boston dynamics is busy making their Robots Dance, I wanted to make my own AI which will play all the games for me and Iā€™ll win all the games. Because I wasnā€™t good at any mobile games at that timešŸ˜…. So I started learning about AI and all . Some how I ended up doing a free Course in Jovian ā€œDeep Learning with PyTorch: Zero to GANā€ it required me to do a project. I wanted do some thing unique. Well I am not quite there yet to make a AI which play games.

Boston Dynamics dancing to ā€œDo You Love Me?ā€ by The Contours on YouTube SOURCE: Boston Dynamics

What Now ? Miraculously this guy Destin came with a video in that period of time on YouTube on his channel about How Sonar Works (Submarine Shadow Zone) ā€” Smarter Every Day 249. Then working with audio came to my mind actually it helped me a lot more on that later.

Choosing Data Sets

Where to Start? To make a AI predict human emotion we first need to show different emotions associated with different audios, these are called labeled Data-sets. I used Kaggle to get a data-set of audio with labeled emotions. I started with RAVDESS Emotional speech audio later added in RAVDESS Emotional song audio.

I got the training data now I just need to build a model and pass it the data to train right? well it was what I thought. Its gonna be easy ā€¦

To my surprise the data-set said ā€œNOā€ you canā€™t feed me directly to that model of yours.

Before exclusion of any emotional classes
After exclusion of neutral, disgust and surprised

Being my first Real project on AI form scratch I spend 3 days learning about how to process the data and make it ready to feed into the model. I used python get the labels in list format. I also excluded surprise neutral disgust emotions form the data-set in the last day off my project as they had very low occurrence and could be leading to low accuracy while predicting.

Basically the function gives us the audio file path(where it is located in the disk) and all the associated labels in a list format.

Splitting the Data for training and validating

Then split the data-set in two for training and to validate the AI. It was done by raw python code first then I used a library scikit-learn to randomly split the data and save them as *.csv files. so that I can compare with others models who use the same data-set and same validation set efficiently using the same validation data-set.

Now the coding part can be unfamiliar to some of the viewers so Iā€™ll try to explain with the minimal code needed to explain my process of developing an AI.

The Model

To build the model I used another open source library PyTorch. I started with the one dimension Convolutional (1d Conv) models with 4ā€“5 layers and used the raw data to train the model. it was disastrous as it turned out ā€œeasyā€ to ā€œnot too easyā€ task. another thing then I was only working with the first data-set and all the classes of emotions the accuracy was at most 14.58% which is horrible.

A bit About 2D representation of audio

I looked for any blog or any research done before on this same problem. After lot of reading and watching videos on YouTube I found people are obsessed with spectrogram when they are working on classifying audio. From what I learned I saw two ways to feed the data to the model. Actually the Spectrogram representation was in the Destinā€™s video too here he is explaining it :

  1. Spectrogram : It is a visual representation of the audio we hear. Actually it is an image of the audio so I can use basic image classification model to classify the audio ! Hurray šŸ„³
  2. Mel-Spectrogram : It is also a spectrogram but in ā€œmel scaleā€. it only shoes us the audio frequencies in human hearing range. It is done by filters out the audio frequencies outside of our hearing by using a filter bank which is generated by the Fourier Transform. The main thing for my project is it reduces the data to process for our model to train on and requires less time too.

How I made these spectrogram You ask ? Here is a awesome funny blog post which helped me to process the audio in this format using another library librosa Getting to Know the Mel Spectrogram.

Reconsidering The Model

After considering all these I again started building the model and processing the data. While I was trying out different models it was not performing well so I started using the resnet18 with some tweaks for my audio classification. All thanks to PyTorch they made the pre-trained model available to us.

When all this modifications are being made I came to know about another representation on audio mfcc . I also considered this as my model then was getting only about 38% accuracy. I made a simple function to generate different spectrograms based on the mode given to it.

In the days before the last submission I was working about 12hrs per day to make it work. It happens to have the day before the deadline I managed to get 66% up accuracy on all the three ways by tweaking the function which was generating the all 2d representation of the audio. And this was the day I introduced the 2nd data-set into the project and removed the less occurring data. Not to mention I also completely stopped using pytorch modules to generate spectrograms and fully went in with librosa.

Results

To be precise the same validation set gave accuracy of 75.63%, 66.87%, 65.94% percent accuracy by using Spectrogram(took 1hr 45min to train), Mel-Spectrogram(took 37min 10s to train) and MFCC (took 22mins 43s to train)Respectively.

The MFCC fed model surprised me as it got to 67% in some cases without the preloaded weights from the resnet18 model, when then others needed to be provided with the full per-trained model and it took the least time.

The Test

But to predict emotions I chosen Spectrogram fed model for predictions simply because it has the most accurate and I gave my most time to train it.

I also made a function to predict given the data(audio) and the model. So lets see what the AI has predicted after all these training :

My AI also fails to predict in some cases like the following:

So can we say machines understand our emotions ?

Not really they are just able to classify the Emotions, in this case from audio. Depending upon this kind of AI developers can make their algorithms so that interaction between the machine and human becomes easy.

Before Ending I want to ask a question did you predicted the last one correctly? did it match with the label?

All the source code is openly available as Jupiter Notebook in Kaggle and in Jovian.

It was written in a hurry so if there is any typo feel free to reach out in comments. If you can spot any mistakes I made also comment it out and for people want to give their suggestions Iā€™m open to them too.

--

--