Week 2— Moodify: Detecting the Mood of Music
Hello and welcome to our second blog post for our BBM406 Project. In last post, we gave a broad view of the subject and our motivation. Before diving into our implementation, this time we would like to explain details of emotion recognition and introduce the dataset we are planning to use.
Emotion Recognition From Music Audio
Describing concept of emotion is not straightforward. Emotions are subjective experiences and and it depends on many factors including culture, education, and personal experience. There are researches showing that mood perception differ between respondents of distant cultural backgrounds.[1] Although it is difficult to classify mood considering music because of the subjectivity, some researches [2] has shown that musical sounds with certain structures usually involve an acceptable degree of common emotional expression.
Moreover, there is a wide variety of methodological choices to make. Different choices result in completely different evaluation metrics, which makes the accuracy of the algorithms impossible to compare. Figure 1 shows different data annotation and representation choices in a form of a labyrinth.
For our implementation, we prefer to classify each song with one label according to a categorical model. Categorical models involve several distinct emotion labels, making classification simple.On the other hand, dimensional models classifies emotions along several and independent axes in space. For example, in Russell’s valence/arousal model which is extensively used, valence stands for the polarity of emotion (in terms of negative and positive states) while arousal represents intensity[3] as shown in figure 2. We stick with a categorical model in order to evaluate our accuracy more precisely.
Our model is also employed for audio mood classification in Music Information Retrieval Evaluation eXchange (MIREX), a well-known contest for the annual evaluation of music information retrieval (MIR) algorithms.[4] This model defines emotion classes as clusters of adjectives, instead of single words. The adjectives in each cluster have close meanings and they are expected to imply same emotional state. Five clusters defined by the model are provided in figure 3 below. [5]
Dataset
In music emotion recognition (MER) studies, datasets are rarely made public and reused, due to audio copyright restrictions. So, there are few dataset alternatives. Creating a new dataset is possible by crawling over the tags from social music websites but that will be costly and labels won’t be reliable enough. Also, we would like to compare our results to other researches using same dataset.
Therefore, we decided to use a dataset[6] organized in a similar way to the MIREX Mood Classification task’s. It contains 903 audio clips labeled with same five clusters of the model used by MIREX. The dataset was built on AllMusic database. This is advantageous because annotations on AllMusic are performed by professionals while ordinary music listeners perform annotations on most of other databases.
Another advantage of this dataset is that it contains lyrics for 764 audio clips out of 903. Although we want to focus on audio features primarily, it provides a chance to improve our model in next stages by combining features extracted from lyrics.
The most important fact making us prefer this dataset is that it was used by several competitors in Audio Music Mood Classification task of MIREX. So that, we can compare our results to those implementations in order to evaluate whether our model’s accuracy is successful enough.
Additionally, it is worth mentioning that we initially planned to use Google Research’s AudioSet ontology[7] which includes 16955 music videos labeled with 7 different moods considering data collected from human listeners. We even obtained all dataset in audio format, preparing them to extract features. But we changed our mind because we could not find any other research to compare our results and the other dataset had some advantages as mentioned above. Still, we think that this dataset is also valuable and we may reconsider using it as a secondary datatest for our model.
Future Plans
For the next week, we are planning to work on choosing the most suitable model and features to be used. That’s it for now, stay tuned for our next post and have an amazing week!
Emir Kaan Kırmacı Tuna Karacan Cihad Özcan
References
[1] Hu, Xiao & Lee, J.H.. (2012). “A Cross-cultural study of music mood perception between American and Chinese listeners”. Proceedings of the 13th International Society for Music Information Retrieval Conference, ISMIR 2012. 535–540.
[2] C. L. Krumhansl.(2002). “Music: a link between cognition and emotion”. Current Directions Psychological Sci., vol. 11, no. 2, pp. 45–50, 2002.
[3] Russell, James. (1980). A Circumplex Model of Affect. Journal of Personality and Social Psychology. 39. 1161–1178. 10.1037/h0077714.
[4] Available at: https://www.music-ir.org/mirex/wiki/MIREX_HOME
[5] Hu, Xiao & Downie, J. & Laurier, Cyril & Bay, Mert & Ehmann, Andreas. (2008). “The 2007 MIREX audio mood classification task: Lessons learned.” ISMIR 2008–9th International Conference on Music Information Retrieval.
[6] Panda, Renato & Malheiro, Ricardo & Rocha, Bruno & Oliveira, António & Paiva, Rui Pedro. (2013). “Multi-Modal Music Emotion Recognition: A New Dataset, Methodology and Comparative Analysis.”
Available at: http://mir.dei.uc.pt/downloads.html
[7] https://research.google.com/audioset/ontology/music_mood_1.html