Automated Emotion Recognition in Songs

Christina Yu
Bucknell AI & CogSci
9 min readMar 27, 2021

Authors: Christina Yu, Swarup Dhar, Hannah Shin, Lawrence Li

Music is like a universal language, capable of conveying emotions, thoughts, and feelings. No matter the culture, the place, or the time period, music seems to penetrate all facets of society. As a group, we are fascinated by music and the emotions it evokes in people. We explored whether we are able to train an automated agent to detect emotional features in songs as closely as possible to those attributed by human listeners. Originally, our idea was to create a recommender system based on a user’s current mood and preferences. However, due to time and technical constraints we pivoted to the current idea, which we also found to be a much more interesting problem to tackle than a recommendation system.

Our Dataset

In order to help us implement an automated system for detecting, we found a dataset (the IMAC dataset) with pre-labeled songs. The dataset can be found here. IMAC leverages the Million Song Dataset to gather songs and uses a global list of words to assign a score to the song based on whether its lyrics match any of the words from their list. The dataset maps each song to a vector of positive, neutral, and negative scores. This provided us with a good starting point for our own project since having pre-labeled data makes our lives a lot easier. This also simplifies the problem for us a little bit. Now, instead of just detecting any sort of emotion of a given song, we simplify the problem to detecting the positivity, neutrality, and negativity of a given song. The dataset contains 4159 rows of labeled song data. This is more than enough for us to train an agent.

Research & Existing Systems

Figure: A mel spectrogram of a 3 second audio file

We should first note that the IMAC dataset was created to specifically address a research study on crossmodal emotional analysis where researchers tried to find emotional correspondence between an image and music (Verma et. al, 2020). In that study, the authors try to train a neural network to find a correspondence between the emotions of a song and the emotion evoked by images. So, we decided to incorporate many of the strategies used by the authors — such as using mel spectrograms — to identify emotions in music since this dataset will go hand-in-hand with their approach to emotion identification. We also found that there is a lot of existing research in this field with lots of open questions still being explored. For example, back in 2012 with the rise of Spotify and music streaming services’ popularity, Yi-Hsuan Yang and Homer Chen published a review of possible methods for working with emotion annotations, model training, and visualizing results. In their paper, they proposed two possible viewpoints for recognizing emotions in music: categorical conceptualization of emotion and dimensional approach to emotion. We gravitated towards the categorical conceptualization of emotion since we thought its technical challenges were easier to overcome and helped us comply with the intended usage of the dataset we are using. Even in 2012 Yang and Chen recognized that machine recognition of emotions in songs was at its infancy (Yang and Chen 2012). In contrast, Verma et al in 2020 uses feature extraction methods, like mel spectrograms, to get an image representation of the music and applying a sleuth of known image processing and recognition techniques to then identify emotion in songs using known supervised techniques.

The Neural Agent

Figure: UML class diagram of our implementation

Our neural agent consists of a 5-layer convolutional network with a MaxPooling2D layer, one input layer, a dropout layer, and a flattened layer. This is appropriate since as we mentioned above, following the approach laid out by the creators of the IMAC dataset, we first create an image representation of the music through mel spectrograms and then apply supervised learning classification on those images. One deviation from the IMAC dataset is that we treat the emotional categories (positive, neutral, negative) as categorical variables with a singular level rather than 5 levels in each category. This is done to simplify the classification process as well as simplify the preprocessing of the data.

Figure: Artist’s rendition of our neural network architecture

Overview of the Entire System

Figure: System pipeline

Shown on the left, the main “production cycle” of our project revolves around a feedback loop between the user and the neural agent. This is so that we are able to correct any false classifications and allow the user to provide their own input about the song. The user can either ask to listen to a song from our existing database of a certain emotion or provide their own song for our neural agent to identify its emotions. Then, the user can get back with their own classification of the song’s emotion, which can then be later used to further improve the system as a whole.

Results

Our model is definitely not perfect but it is sufficient for an MVP. The training accuracy is 98.69% but the validation accuracy is only 80.53%. However, if we were to continue this project and allow for users to provide feedback, over time we believe that our model could be made more robust to fit with each user’s music emotion preferences.

Figure: The train accuracy and loss through the first 80 epochs

The train loss graph has an inversely proportional relationship with the epochs, which is satisfied in terms of the neural network training accuracy of this project. The train and test (validation) accuracies deviate from 10 epochs, suggesting a big gap between both accuracies.

Ethical Analysis

During the implementation of our project, we had gone through many different ideas and had adjusted our project as we went. Our project consisted of five main sprints and in each sprint, there were minor adjustments to our ethical analysis.

At the end of our project, we would have hoped to finish accommodating the corrections and feedback from the user (in our frontend implementation) but even if we had there were a few things we should make sure to concern ourselves with. In the case that we were able to take in user feedback, we should then be conscious of the demographics of those who give us feedback (especially since in the current state of the project one person giving feedback could affect the database for all users). As mentioned in the article “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big”:

“large datasets based on texts from the Internet overrepresent hegemonic viewpoints and encode biases potentially damaging to marginalized populations” (Bender et.al, 2021).

Although our project did not accept direct text, it was still planned to take in user input, which would lead to the same result. In the other case where we do not take in user feedback what can happen is that we would incorrectly rate songs and that could potentially lead to a negative outlook on certain songs and their singers. One thing that we wish we could have done with our data was to categorize our data to test ourselves if there happened to be any sort of apparent bias in the music emotion scores but from the way the data was organized, there was no efficient way to do so.

During the creation of our project, we had to find a way to playback the songs to the user. The method we chose was to download our songs from YouTube which has its own copyright over their content. To use their content we had used the Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license to make sure we are able to download and share the content we downloaded. Our project is not for commercial purposes and only meant to be used in the confines of our class project. We give credit to the original author that uses YouTube URL download for the IMAC dataset as well as the million song dataset. (Verma et. al, 2019) (Bertin-Mahieux et.al, 2011). We do not support people downloading copyrighted content from YouTube URLs without permission. However, this project is still vulnerable to content abuse. To follow the ACM Code of ethics, we only allow the current AI to predict emotions and keep the downloaded URL sources and all the .wav files downloaded hidden from potential abusers. The way we could approach this is to hide all the downloaded files from the public and the user can only play the song from the given program. In other words, users are not allowed to get the .wav song file. We do acknowledge that by doing so we will hide the downloaded URL procedures and neural network training and prediction details from the public in general. We believed that the copyright of the content is more important than the transparency of the public.

Following the ACM Code of Ethics, our project did its best to be as transparent as possible even though we mentioned the copyright issues in the above paragraph. Our project does not support any form of discrimination and if more time was present to try to sort data better we would have tried to find any indication of discrimination and act against it any way we could. If our project were to have completed personalized features for users we would have made sure to honor the privacy of the users and make sure the data itself would not be used for any other purpose other than for the users' own benefit (also being transparent to the users about this as well). Our group also made sure to communicate well with each other to try to point out any cases where ethics could be in review and should be taken into consideration.

Conclusion

In summary, our neural agent recommends songs of a specified mood to the user, based on our training dataset from IMAC. Recognizing that everyone experiences music and moods differently, future improvements would take into consideration the user’s current music taste and invite user feedback to improve our predictions for different users. In addition, we’d expand on the categories of moods beyond “positive”, “neutral”, “negative”, and include moods such as calm, distress, acceptance, etc. Working on such systems can reveal further trends in how different people experience music, and could showcase mechanisms artists employ to provoke certain emotions in the listeners.

References

Bender, E. M., Gebru, T., Mcmillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. doi:10.1145/3442188.3445922

Bertin-Mahieux, T., Ellis, D. P., Whitman, B., & Lamere, P. (2011). The million song dataset

Brownlee, J. (2020, August 27). How to save and load your keras deep learning model. Retrieved March 22, 2021, from https://machinelearningmastery.com/save-load-keras-deep-learning-models/

Huq, A., Bello, J. P., & Rowe, R. (2010). Automated music emotion recognition: A systematic evaluation. Journal of New Music Research, 39(3), 227–244.

Juthi, J. H., Gomes, A., Bhuiyan, T., & Mahmud, I. (2020). Music emotion recognition with the extraction of audio features using machine learning approaches. In Proceedings of ICETIT 2019 (pp. 318–329). Springer, Cham.

Kim, Y. E., Schmidt, E. M., Migneco, R., Morton, B. G., Richardson, P., Scott, J., … & Turnbull, D. (2010, August). Music emotion recognition: A state of the art review. In Proc. ismir (Vol. 86, pp. 937–952).

Soleymani, M., Caro, M. N., Schmidt, E. M., Sha, C. Y., & Yang, Y. H. (2013, October). 1000 songs for emotional analysis of music. In Proceedings of the 2nd ACM international workshop on Crowdsourcing for multimedia (pp. 1–6).

Vaidya, K. (2020, December 09). Music genre recognition using convolutional neural networks (CNN) — Part 1. Retrieved March 22, 2021, from https://towardsdatascience.com/music-genre-recognition-using-convolutional-neural-networks-cnn-part-1-212c6b93da76

Vaidya, K. (2021, February 11). Music genre recognition using convolutional neural networks- part 2. Retrieved March 22, 2021, from https://towardsdatascience.com/music-genre-recognition-using-convolutional-neural-networks-part-2-f1cd2d64e983

Verma, G., Dhekane, E. G., & Guha, T. (2019, May). Learning affective correspondence between music and image. In ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 3975–3979). IEEE

Yang, Y. H., & Chen, H. H. (2011). Music emotion recognition. CRC Press.

--

--