Predicting Spotify Song Popularity with Machine Learning

What AI has to say about what makes a song a smash hit

Code AI Blogs
CodeAI
4 min readAug 24, 2021

--

Introduction

Songs released by popular singers tend to be chart-toppers, but what about the occasional track put out by a lesser known artist that becomes a smash hit seemingly at random?

My goal is to create a machine learning model that can predict a song’s popularity.

To do this, I’ll look at features like danceability, energy, and even speechiness.

Data Source

This is the dataset that I’ll be using:

Note: the original source is now a dead link; the above link is the new source. As the dataset is updated overtime, there may be differences between the dataset used in this article and the one available at the above link.

This dataset contains data on tracks released from 1921 to 2020. The data entries include the following features:

  • id: ID of track generated by Spotify
  • id_artists: ID of artist generated by Spotify
  • acousticness: ranges from 0 to 1
  • danceability: ranges from 0 to 1
  • energy: ranges from 0 to 1
  • duration_ms: duration of track in milliseconds; integer typically ranges from 200k to 300k
  • instrumentalness: ranges from 0 to 1
  • valence: how happy the song is; ranges from 0 to 1
  • popularity: ranges from 0 to 100
  • tempo: float typically ranging from 50 to 150
  • liveness: ranges from 0 to 1
  • loudness: float typically ranging from -60 to 0
  • speechiness: ranges from 0 to 1
  • year: ranges from 1921 to 2020
  • mode: 0 = minor, 1 = major
  • explicit: 0 = no explicit content, 1 = explicit content
  • key: all keys on octave encoded as values ranging from 0 to 11, starting with C as 0, C# as 1, etc.
  • artists: artist’s name
  • release_date: date of release
  • name: name of song

This is the same dataset I used for my previous article on Spotify data exploration, which you can check out for some data analysis:

Setup

First things first, I’ll import the Python libraries that we’ll need throughout this project and load the dataset into a Pandas DataFrame:

For this project, I’ll be using the k-nearest neighbors (KNN) algorithm. This algorithm first finds the k closest points to the data point of interest. The predicted popularity of this point of interest will then be the average of the popularity of the k closest points.

Data Preprocessing

With everything set-up, we’ll move onto preparing our dataset for our machine learning model!

To start, we’ll drop all of the non-quantitative attributes, as the k-nearest neighbors regressor algorithm requires quantitative features. We’ll also remove all songs released before 2016 so that our dataset is more reflective of current trends.

We’ll also normalize the dataset so that every variable is in the range from 0 to 1. This means that all of our features will have the same weight in our predictor.

Now, we’ll randomly split our data into training and testing sets in an 8:2 ratio. This will allow us to determine the final error of our model when predicting on previously unseen data. We’ll also make a validation set from the training set. This validation set will be used to determine the optimal value of k for our model later on.

We’ll next separate the data into X and Y.

  • X: all the columns that our model will use to predict popularity (i.e. valence, acousticness, loudness, etc.)
  • Y: popularity column

Finally, we’ll define our error function. We’ll use the mean square error between what we predict the popularity to be and what the popularity actually is.

Training the Model

Now we’re ready to train our predictor!

We first have to determine the optimal value of k using the training and validation sets:

Using our results from the previous code cell, we’ll plot the k values against their respective error.

From the plot, we can see that the error consistently decreases as the value of k goes up, with the lowest error corresponding to a k value of 49. If we choose this lowest error value however, we would be overfitting our model on our dataset.

Overfitting means that we’re training our model to be overly specific to our dataset, to the point that our model is a poor predictor for unseen data.

A rule of thumb for determining the best k value for the k-nearest neighbors algorithm is to look for the elbow of the plot. This elbow is where the error stops decreasing rapidly.

In our case, the elbow is approximately at k = 8, as the plot subsequently levels off. k = 6 is also a good candidate for the elbow, but either candidate should perform well for our purposes.

Going with k = 8, we’ll now train our final model using the training set and determine our model’s final error using the test set.

Our predictor’s final error is 0.03357, which is a pretty good model!

--

--