CLASSIFYING SPOTIFY MUSIC PREFERENCES: A COMPARISON OF ALGORITHMS
What if there was a way to get your computer to tell if you would like a song or not based on the particular features of that song? Most music streaming apps use algorithms that recommend songs based on the artist and genre, but what if we could recommend a song based on its intricate attributes, like the tempo, or the song’s ‘danceability’? It just might be that songs picked with precision would be more appealing to a listener’s specific tastes.
Here, I explore different machine learning algorithms on the Kaggle “Spotify Song Attributes” dataset and try to form a model that can decide if a user would like a song.
This dataset contains 2017 different songs, each with associated attributes and a label. The song attributes include features like ‘energy’, ‘loudness’, and ‘valence’, which are all used in the algorithms below for classification. Each song is either classified ‘1’ (to mean a user liked the song) or ‘0’ (to mean a user didn’t like it). Our goal is to be able to predict if a user likes or doesn’t like a song based on its attributes. I do so testing out four different algorithms below — k-nearest neighbors, logistic regression, lasso regression, and gradient boosting — and compare their scores on the test data.
The training data is defined to have feature vectors x1, x2, …, xN ∈ R13 and corresponding target values y1, y2, …, yN ∈ {0, 1}. To prepare the data, I removed the ‘song title’, ‘artist’, and ID columns, normalized the leftover data, and converted it into an array. I split the data into an 80% training set and 20% validation set for testing. In each method below, I test my predictions from scratch by dividing the number of correct predictions with the total number of predictions.
The first classification algorithm used is K-nearest neighbors. K-nearest neighbors is both a classification and regression algorithm used for predicting a test observation’s target value in a simple and intuitive way. The algorithm works by taking a test observation x0 and identifying the K points in the training data that are closest to x0. In a regression problem, the average of the y-values of those K points become x0’s predicted value. In a classification problem, the K-nearest neighbors’ y-values take a vote so that x0 is classified to the class with the greatest number of votes.
Here, I implement K-nearest neighbors from scratch using K = 10. The algorithm starts by sweeping through the song test data row-by-row, then it calculates the Euclidean distance between that test example and each song in the training data, then sorts those distances from smallest to largest in terms of their index in the training data, and finally has the first 10 take a vote to decide which class the test example belongs to. The predictions for all of the text examples are stored in a list.
The algorithm produced an accuracy score of 69.8%, which isn’t too bad, but this score could surely be improved from at least one of the algorithms below. After testing multiple values for K, 10 appears to be best; anything above or below it results in a lower score.
The second classification algorithm used is binary logistic regression with stochastic gradient descent. In logistic regression, we model the relationship between two qualitative variables using a logistic model to perform classification between two classes (0 and 1). We define:
where x^i is an augmented feature vector. Unlike K-nearest neighbors, the logistic model outputs the probability that a test example belongs to class 1. For a feature vector xi, we call its predicted probability f(xi) and define it to be:
σ(u) is our logistic function used for converting a scalar into a probability. We hope f(xi) is consistent with the ground truth probability yi for i = 1, 2,…, N. In order to do so, we must find β for which
is smallest. (We use the cross-entropy loss function to measure the difference between our predicted probability and our ground truth probability.)
In this case, we minimize L(β) using stochastic gradient descent, which incrementally moves β towards the direction of steepest descent (the opposite direction of the gradient of L(β)). Stochastic gradient descent starts by initializing a value for β called β0 . Then, for every sweep through the training dataset, or epoch, we randomly select an i at random and update β to be:
for t = 1, 2, 3,… up to however many epochs we’d like to perform. Here, α is our learning rate and the gradient of Li(β) is just the gradient of l(yi,f(xi)). Using ∇Li(β) instead of ∇L(β) is less taxing on our algorithm compared to normal gradient descent, and introduces randomness to the model. This helps to prevent overfitting since it inhibits the model’s ability too follow the noise in our training data too closely.
When we have reached our final β value, we can now classify our test data as follows: for each test example xi, if our predicted probability f(xi) ≥ 12 then we classify it as class 1, otherwise we classify xi to class 0.
In the code, I use scikit-learn’s implementation for logistic regression using both normal and stochastic gradient descent, as well as my own implementation from scratch. β is initialized to be all zeros, alpha to be 0.001, and the number of epochs to be 150. The code also computes the cost function at every iteration, and plots the resulting values. To compare differences in accuracy over the number of epochs, I test the algorithm using anywhere from 50 to 500 epochs, and plot each accuracy score versus the number of iterations.
In my comparison of the accuracy versus number of iterations of SGD, you can see how more iterations caused lower accuracy in Figure 1 below. The decrease in accuracy is a sign of overfitting — with more iterations, our model was more prone to following errors, thus creating a fit only desirable for the training data and not the test data. To curb this, the model stops after 150 iterations, since it seems to be both the point of convergence in Figure 2, and a high accuracy point in Figure 1.
Unexpectedly, both scikit-learn’s implementation and my own did not beat the score of K-nearest neighbors. My SGD implementation also received a higher score than scikit-learn’s implementation (😊). Sci-kit learn’s regular logistic regression algorithm yielded 53.8% accuracy, their SGD algorithm produced 51.2% accuracy, and mine got 58.5% accuracy.
In this section, we introduce a shrinking method called Lasso which attaches a regularization term, or penalty, to the original logistic regression model to further reduce overfitting. Before, we chose β with the goal of minimizing L(β). Now, we will choose β to minimize L(β) + λ∥β∥1. This second term is called the l1-regularization term R(β), and represents the number of nonzero β coefficients. We want this number to be small so we have very little nonzero β coefficients to promote sparsity, and let the model focus on features that are actually helpful for performing accurate predictions.This is called feature selection, and it naturally reduces overfitting.
Adding R(β), however, requires a new minimization procedure since the term is not differentiable. Gradient descent won’t work here, so we look to the proximal gradient method. This method has us initialize β as before, but update beta as follows:
for t = 0, 1, 2, … where the proximal operator of the l1 norm is defined to be
The clip function cuts β^ down to αλ if it is outside the range of |αλ|.
Otherwise, β is left as is. Like gradient descent, β is continuously updated until the equation above converges to its minimum.
I implement the algorithm from scratch setting λ = 20, α = 0.0001, and the number of iterations to 60. At every iteration, β is updated as described above, and the cost function values of L(β) at each iteration are calculated and stored in a list to be plotted in a graph later.
The accuracy improves to 67.3% after performing Lasso, but it isn’t much. Figure 3 below shows L(β) converging after about 30 iterations, but to be on the safe side I let the algorithm run through 60 iterations. In Figure 4, about 7 coefficients of β can be seen to have values nearly equal to 0, demonstrating its sparsity. β could be made more sparse by increasing λ, but having too large a λ could make the model prone to underfitting, and therefore reduce the accuracy. We can see, however, that the model finds the danceability, duration, energy, key, mode, tempo, and valence of a song to be the most important features for classifying whether a person likes it or not.
The last model we use is called gradient boosting. Gradient boosting is a regression and classification method that takes the form of an ensemble of decision trees. A decision tree is itself a prediction model that takes an input variable, makes observations about it (represented by branches), and makes a conclusion about its target value (represented by leaves) in a flow-chart-like structure. An ensemble of decision trees is merely a combination of all of these models so that the final target value of an input variable is the average of its target value in each tree. A decision tree is represented as T(xi) where xi is the input variable.
Gradient boosting for binary classification starts by defining the training set to be feature vectors x1, x2, …, xN ∈ R13 and target values y1, y2, …, yN ∈ {0,1}. Our predicted target value for an observation is
where Tj is a regression tree. Similar to logistic regression, our goal is to choose Tm(xi) which minimizes the loss function
This can be further rewritten using Newton’s Approximation:
If we set the residual ri = α(yi − σ(fm−1(xi))), then one could observe that the algorithm merely finds Tm by fitting a regression tree to the residuals ri.
For simplicity, I use the gradient boosting algorithm for classification provided by XGBoost. XGBoost includes a regularization term in its computation to control overfitting and is known to out-perform the gradient boosting machine both in speed and accuracy. After testing out multiple learning rates and multiple values for n_estimators (the number of trees in the model), a learning rate of 0.01 and n_estimators of 1500 worked best.
XGBoost proved to be the best algorithm for this example, which isn’t surprising since it’s known to be a particularly great classification and regression algorithm. The highest accuracy score with this model so far is 80.4%.
Overall, if I were to rank the algorithms on which best predicted a person’s song preferences in this dataset, it would be:
(1) Gradient Boosting: 80.4%
(2) K-Nearest Neighbors: 69.8%
(3) Logistic Regression with Lasso: 67.3%
(4) My Logistic Regression with SGD: 58.5%
(5) Sci-Kit Learn’s Logistic Regression with GD: 53.8%
(6) Sci-Kit Learn’s Logistic Regression with SGD: 51.2%
I find it funny that this dataset is best modelled by the simplest classification algorithm and the most complicated classification algorithm. Even then, we were able to predict 80% of a person’s song preferences correctly and that’s a solid B-. Of course, there’s room for improvement here and I invite the next data analyst to do just that, whether it’s by upgrading the algorithms used above, or by creating a completely different model.