Exploring the Potential of AI for Music Emotion Recognition and Generation

Published in

d*classified

8 min readMay 24, 2023

Through analyzing instrumental audio content, Lim Xi Chen Terry has explored various models and techniques to classify audio clips based on the emotional response they invoked in listeners. Also, AI-generated audio was explored to improve the performance of the classifier. He was mentored by Daniel Lee, Head Engineering (Information) at the Defence Science and Technology Agency. This post is a derivative, abridged version of Terry’s original report.

TLDR: The sound in a video can affect the listener’s emotions and perception, making it a strategic tool in countering information campaigns. Analyzing and categorizing audio content can aid in understanding its impact. Our study compared different feature sets and models and found that the SVC model with Wav2Vec2 embeddings performed the best. We also explored using AI-generated audio to generate more training data for future analysis.

Background

Every morning, I am greeted by the melodious chirping of birds and the rhythmic hum of buses rushing by. Prior to working on this project, I had never truly appreciated the beauty of these daily phenomena. However, I began to wonder — what exactly is sound?

In its raw, non-digital form, sound is basically variations in air pressure that the human ear can detect. It consists of 2 things:

A combination of wave frequencies at various intensities
Time factor (Continuous for a duration as opposed to a snapshot at a single instant)

It is important to note that sound data does not conform to a structured format, unlike tabular or image data. As such, processing sound data can be a formidable challenge. However, the difficulty did not stop here, as I was assigned to not just 1 project, but TWO projects. How cool is that?

The two projects are:

Project Emolysis (Audio Emotion Recognition through Machine Learning)
Project MoodGenius (Music Generative AI)

Project Emolysis Overview

Videos are often used used by terrorists to spread propaganda and influence the public. Besides the use of persuasive video imagery, audio can be used to influence as well. For example, the strategic use of melancholic music can evoke sympathy towards a specific agenda. Given the increasing prevalence of short-form videos on online platforms, which often feature instrumental music as the background audio, it would be of interest to explore tools to analyze and categorize audio content.

Dataset

Using Python packages, I have gathered a selection of music compilations from online sources, which I have meticulously curated into a comprehensive dataset encompassing a total duration of 20 hours and 17 minutes. It comprises five distinct emotions that each song can evoke, namely Angry, Calm, Fear, Happy and Sad. Subsequently, the data is passed into the following pre-processing pipeline.

Pre-processing

Firstly, the raw data is broken down into individual songs based on a maximum loudness of 40dB and a minimum pause length of 1 second. The reason behind this approach was to target the transition between songs, where a quiet pause of about 1 second would be present. The 40dB loudness was chosen because it is equivalent to the noise level in a quiet library.

Next, as certain metadata pertaining to the song was absent from the downloaded media, there exists a chance of identical songs being downloaded from the numerous compilations. This could potentially lead to data leakage during model training, where the same song may appear in both the training and testing datasets. To tackle this issue, several techniques were employed to compare the audio clips.

Overall, Method 1 was chosen as the primary checker, with Method 3 serving as a sanity check to ensure a good accuracy of the results through comparison of the metadata.

Next, as there were a limited number of unique songs available in the downloaded songs, this leads to a lack of data for training of the model. This led to the idea of splitting each song into 5s intervals to increase the number of data points. However, an assumption is made, where every segment of the song conveys the same emotion throughout.

To ensure accurate song segmentation, it was necessary to address the issue of extraneous data points caused by silences at the beginning and end of each song. To accomplish this, leading and trailing silences were stripped from the audio samples, using a threshold loudness of 40dB.

Modelling

Two types of features were targeted: Prosodic and Spectral features. Prosodic features refer to the musical elements that convey meaning and emotion beyond the literal notes played, which includes pitch and energy. Spectral features refer to the unique patterns of sound waves that make up different musical instruments and notes, which includes Mel Frequency Cepstral Coefficients (MFCCs), Zero-Crossing Rate, etc. The following table showcases the four feature sets that were explored in this project.

In the modelling process, two approaches were explored: Traditional Machine Learning and Deep Learning.

Under Traditional Machine Learning, various categories of models were explored, namely Linear models, Trees, Support Vector Machines, Naive Bayes, and Discriminant Analysis.

Under Deep Learning, the HuBERT model was explored. Although originally intended for speech representation learning, the model has versatile applications and can be utilized in this project. By employing transfer learning techniques, it can be customized for the new task at hand. Specifically, for this project, the BERT encoder with 12 layers can be unfrozen and fine-tuned for optimal performance.

The dataset is initially split into training and testing sets using an 80:20 ratio. Augmented data is present in the training, but not in the testing set. To identify the top-performing models, I had performed 5-fold cross-validation, where the training data is further split into training and evaluation data using an 80:20 ratio for each fold.

To prevent overlaps between clips from the same song and ensure that there are roughly equal numbers of classes in the evaluation and testing data, stratified group split was used for both 80:20 splits.

After obtaining the top 5 performing models, these models are refitted with the entire training data (including its augmented counterparts) and tested on the hold-out test set.

The performance of the models was compared based on several metrics, such as Unweighted Average Recall (UAR), Macro F1 score, and Weighted F1 score.

Model Evaluation

Results for the top ten best-performing models from both approaches, sorted from the highest UAR score to the lowest.

It is apparent that the SVC and XGBoost models have displayed superior performance, with UAR scores exceeding 0.854. It is noteworthy that the models utilizing a higher number of features have performed better in comparison. Furthermore, the UAR, Macro F1, and Weighted F1 scores are relatively similar, signifying the consistent performance of these models in predicting the target emotion.

As seen from the confusion matrix, the best performing model can classify each class quite accurately, except for fearful songs predicted as sad songs and sad songs being predicted as fearful or calm songs. During error analysis, even I as a human sometimes found it difficult to distinguish the difference between these songs, as they similarly contain slow tempos and repetition.

Project MoodGenius Overview

Project Emolysis requires more varied and customizable audio content for training its model. Getting new data is difficult and expensive, and existing audio libraries may not be helpful. This project aims to explore AI-generated audio to create more training data for future analysis.

I had explored and obtained favorable results from two models: GANSynth, and Riffusion. The data utilized for GANSynth was sourced from Project Emolysis, wherein the full-length audio clip was used instead of the split 5-second audio clips.

GANSynth

GANSynth, a model developed by Google AI, is capable of synthesizing input audio through Generative Adversarial Networks (GANs) trained on the NSynth dataset, which contains approximately 305k musical notes. By modelling log magnitudes and instantaneous frequencies with sufficient frequency resolution in the spectral domain, GANSynth can generate high-fidelity and locally coherent audio.

How GANSynth works: A MIDI file is inputted into the model, and a duration per random instrument in seconds is chosen. This results in an interpolation between each random instrument, with a larger number resulting in slower and smoother interpolations. The generated audio for each note is then combined to create a synthesized audio.

Problem: One issue with this model is that an external tool was used to convert the mp3 audio data in Project Emolysis to MIDI files. In some cases, errors can occur during the conversion process, where the melody may not be fully captured in the MIDI file, leading to off-tune results obtained.

Riffusion

Riffusion is a neural network, designed by Seth Forsgren and Hayk Martiros, that generates music using spectrograms (i.e. images of sound) rather than the raw audio waveform. Through fine-tuning of Stable Diffusion, this results in a model to generate spectrograms from text prompts, which can then be transformed into audio files through an inverse Fourier transform. Although the generated audio files are short, typically lasting only a few seconds, the model can also use the latent space between outputs to blend different files together in a harmonious manner.

Through the assistance of a musical prompt generator, any given prompt can be extended up to 30 words by incorporating relevant musical terms to improve the description in the prompt.

Samples

The following are some tunes generated by the various models:

GANSynth generated tune

Riffusion generated tune — prompt: “happy music”

Here are some of the poorly generated tunes:

GANSynth generated tune

Riffusion generated tune — prompt: “scary haunted fearful music”

Putting it Together

A web service was built using FastAPI to demonstrate the output of the classifier. By posting the file path of an audio clip to a locally hosted API, the model would process the audio clip and classifies it into one of the five distinct emotions.

Conclusion

In conclusion, I am thankful for the opportunity to apply the knowledge and skills I have gained in school to solve a variety of complex problems in this internship. Performance of the models in both projects could be improved in the future with exploration of more emotions and experimentation with various deep learning models.