Published in


Music Mood Classification using Neural Networks and Spotify’s Web API

Understanding best practices in a Multi-Class Classification Machine Learning workflow

Photo by Rachit Tank on Unsplash

The relationship between music and emotion is well documented, but if you’re more than a casual listener of music, you probably already have an idea of how deeply connected these two notions are. Be it listening to Energizing music in the morning to kick-start the day, Ambient music while working, or googling about sad songs when going through a heartbreak, many of us, at some point in our lives, have used music to elicit an emotion or to find closure about an existing emotion.

In this article, we’ll create a classification model to predict the emotion invoked by a song, which will take audio features of the song as input, and output the corresponding mood or emotion label. Spotify’s Web API provides us with the audio features, out of which the following would be used as inputs (The explanations for what each of these features stands for is taken from their API documentation):

  • Energy: Represents a perceptual measure of intensity and activity
  • Liveness: Detects the presence of an audience in the recording
  • Tempo: The overall estimated tempo of a track in beats per minute (BPM)
  • Speechiness: Detects the presence of spoken words in a track
  • Acousticness: A confidence measure of whether the track is acoustic
  • Instrumentalness: Predicts whether a track contains no vocals
  • Danceability: Describes how suitable a track is for dancing
  • Duration: The duration of the track in milliseconds
  • Loudness: The overall loudness of a track in decibels (dB)
  • Valence: Describes the musical positiveness conveyed by a track

All of these attributes have values between 0.0 and 1.0. The moods that we’ll be using are: Energetic, Relaxing, Dark, Aggressive, and Happy. The scikit-learn library will help us in creating and analyzing our models, and splitting our data.

This article is inspired by Ting Neo’s and Cristóbal Veas’s respective works on Music Mood Classification. I hope that this tutorial helps beginners get more clarity on the ML workflow, at least regarding switching among various model choices for multi-class classification problems with numeric input features.

I’d like the main takeaway from this article to be that ML is not always about running models and evaluating accuracy. Interpreting and analyzing our classifier will give us important insights on how we can change our dataset to suit our needs.

Getting the Data

The data is obtained from Spotify’s user-created mood-based playlists using the spotipyPython library. To start with, let’s get the audio features of 1 playlist’s worth of songs for each mood. For now, our dataset has 484 tracks.

Now, we’ll split the data into training and test sets. Later we’ll use the training set to cross-validate. The test set will be used to evaluate our model after we have optimized it in cross-validation. The training set has 324 tracks and the test set has 160 tracks.

trainx, testx, trainy, testy = train_test_split(data, moods, test_size = 0.33, random_state = 42, stratify=moods)

Using the stratify parameter ensures that the class distribution for our train and test data is the same.

Exploratory Analysis

Let’s take a look at the class distributions in our training and test set.

Train class distribution:Dark          67
Relaxing 67
Energetic 67
Happy 66
Aggressive 56
Test class distribution:Relaxing 33
Happy 33
Dark 33
Energetic 33
Aggressive 28

We seem to have an almost equal distribution of all moods in our training and test set.

Now let’s look at the mean value of all our attributes for each mood:

As expected, Aggressive and Energetic songs have high energy, as opposed to Relaxing songs which have low energy. Relaxing music also has a much slower tempo and tends to be more instrumental than other moods. As you keep scrolling right, you’ll notice that Relaxing is a mood that can be easily discriminated from other moods. But we’ll have to work a little harder to discriminate among others.

Exploring our data has thus given us an interesting insight, which we can later use to tweak our predictions. For example, if our model gives poor results for Relaxing songs, we can add heuristics for classifying a song as “Relaxing” instead of relying on our classifier’s predictions.

Creating our First model

It's better to start simple and get a baseline model in place before optimizing it. So let's try Logistic Regression. For preprocessing, we’ll scale the data to a standard normal distribution.

train_scaled = scaler.fit_transform(trainx)
logreg = LogisticRegression(max_iter=2000)
scores = cross_val_score(logreg, train_scaled, trainy, cv=5)
print (scores.mean())

We get an accuracy of 65% using cross-validation. Here, our earlier “training” set of 324 tracks has been split so that 260 of those are used for training in a particular iteration of cross-validation and 64 are used for validation. Logistic regression has a regularization parameter called C which can be optimized using GridSearchCV. This gives a minor improvement in accuracy to 66%

Interpreting the model

We can interpret our model by looking at the importance given to each feature (which would be different for each mood). This can be obtained by calculating the Euler number to the power of the coefficients of our logistic regression. The following shows us the variable with the highest importance for each of our moods.

Aggressive    speechiness
Dark acousticness
Energetic energy
Happy valence
Relaxing instrumentalness

It seems no surprise that the “energy” attribute is a good discriminator of whether a song is “Energetic” or not. Aggressive songs tend to have a high level of speechiness (it’s a bit difficult for instrumental songs to be aggressive). Dark songs are best discriminated by a high level of acousticness. For Happy songs, valence is the dominant mood deciding feature (which is consistent with valence’s representation of musical positiveness of a track), while an instrumental song has a higher chance of being “Relaxing”.

Note that this interpretation is for our current dataset of 484 songs. While it seems to be consistent with the general intuition we have of music, a more accurate and general prediction can be obtained by increasing the size of our dataset.

The second model — Neural Network

We now have a baseline accuracy (66%) using cross-validation and can move on to neural networks. The first decision that we have to make is on the architecture of our NN. According to NN practitioners, a good approximation is to start with 1 hidden layer. The number of neurons in this hidden layer can be taken as the average of the number of units in the input and output layers. Our input has 10 units, and the output layer has 5. So it is suitable to start with 8 units in our hidden layer.

NN architecture generated using NN SVG

This particular architecture gives us a CV accuracy of 66%, the same as that of the logistic regression classifier.

We can optimize the hyperparameters “alpha” denoting the amount of regularization and the number of neurons in the only hidden layer of our NN. We get a cross-validation accuracy of 67% with 10 neurons in the hidden layer and alpha as 0.1

Analyzing our model

In order to figure out the next steps, it makes sense to plot the training and validation accuracy as we vary the size of our dataset. This is called a learning curve which helps us determine if adding more data will help in improving the accuracy.

Based on the learning curve, we can conclude that adding more data would seem to help since the training score and validations score curves have not yet converged.

So now let’s get 2 playlists’ worth of tracks data for each mood. This gives us a total of 914 tracks, with 490 used for training, 122 for validation, and 301 for testing. Hyperparameter optimization gives us a 69% accuracy with alpha=1.0 and 100 hidden layers.

However, having 100 hidden layers for our model makes us question whether we are overfitting on our dataset. And on comparing the training and validation accuracy, we find out that is indeed the case.

Train      : 79%
Validation : 69%

Let's switch back to a model with 8 hidden neurons.

Train      : 71%
Validation : 66%

This is a more general classifier since the gap is less.

Another point to note is that although our current CV accuracy (66%) is less than that obtained on a smaller dataset (67%), we can expect the current model to have better generalization since it is trained on a dataset that is almost double in size. And we can verify that by comparing the results on the test set: 70% using our current model and 61% using the earlier model.

So our final classifier has an accuracy of 70% on the test set.

Error Metrics

In addition to accuracy, it would be useful to look at other error metrics like Precision and Recall, which in the case of multi-class classification can be represented as a confusion matrix.

The major area of interest is Energetic tracks being classified as “Dark” since these have the highest number of incorrect classifications. The most obvious thing that could be fixed, on the other hand, is “Happy” being classified as “Dark” and vice versa.

Conclusion and Future Work

We went through the workflow of an ML process in this article. We started by exploring our data and figuring out which attributes help in discriminating among the moods. During modeling, we took a simple Logistic Regression classifier and then moved on to Neural Networks. Having performed hyperparameter optimization, we observed how it can sometimes cause overfitting. And finally, we analyzed our model to figure out if adding more data will help it generalize better.

There are a number of improvements that can be done:

  1. Vary our training dataset and include more “Energetic” songs since they are the ones being misclassified the most.
  2. Try out other models like SVMs and Random Forest.
  3. Perform feature engineering to create features that are better at discriminating moods.

I hope this article helps you in getting a better understanding of Multi-Class Classification. The associated code is available at this Github repository.





Everything connected with Tech & Code. Follow to join our 900K+ monthly readers

Recommended from Medium

Neural Network (Algorithm used in Stock market Prediction)

Perceptron vs SVM: a quick comparison

Understanding language modelling(NLP) and Using ULMFIT

Fooling Deep Neural Networks Using Unrecognizable Images

The Machine Translation Year in Review & Outlook for 2017

Neural Language Models as Domain-Specific Knowledge Bases

Machine Learning and Civil Liberties

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Karan Singh

Karan Singh

Music and Tech enthusiast

More from Medium

Predicting the Difficulty of Texts Using Machine Learning and Getting a Visual Representation of…

Cervical Cancer Prediction Using Machine Learning

End To End Machine Learning Project

Sequence to Sequence Learning With Neural Networks To Perform Number Addition