Modeling data for a Spotify Recommender System

10 min readDec 15, 2021

The Data

For this project we are using The Million Playlist Dataset (MPD) released by Spotify. As it name implies, the dataset consists of one million playlists and each playlists contains n number of songs and additional metadata is included as well such as title of the playlist, duration, number of songs, number of artists, etc.

This dataset was created by sampling playlists from the billions of playlists that Spotify users have created over the years. Playlists that meet the following criteria were selected at random:

Created by a user that resides in the United States and is at least 13 years old
Was a public playlist at the time the MPD was generated
Contains at least 5 tracks
Contains no more than 250 tracks
Contains at least 3 unique artists
Contains at least 2 unique albums
Has no local tracks (local tracks are non-Spotify tracks that a user has on their local device)
Has at least one follower (not including the creator)
Was created after January 1, 2010 and before December 1, 2017
Does not have an offensive title
Does not have an adult-oriented title if the playlist was created by a user under 18 years of age

As you can imagine, a million anything is too large to handle locally and because of this, we are going to be using 2% of the data (20,000 playlists) to do the analysis, create the model and deployment. After we can prove everything is working the way we want it we will scale it to an AWS instance.

Pipeline

Below is the pipeline proposed showing the inputs, processes and outputs for our recommendation system. This is a big picture of the necessary steps to build the system.

Enhancing the data

Since this dataset is released by Spotify, it includes a track_id that can be used to generate API calls to access the multiple information that is provided from Spotify for a given song, artist or user. To do all the API calls, we are using Spotipy, which is a lightweight Python library for the Spotify Web API.

These are some of the audio features that are available to us for each song and I will be using them to enhance the dataset to help matching the user’s favorite playlist and to build the model. They are measured mostly in a scale of 0–1:

Danceability: a measurement of how “danceable” a given song is.
Energy: perceptual measure of intensity and activity.
Intrumentalness: whether a song contains no vocals (pure instrumental).
Liveness: presence of audience in a song i.e.: on a concert.
Loudness: How loud a song is in dB.
Mode: Minor or Major mode.
Speechiness: presence of words in a song.
Tempo: Beats per minute (BPM).

Shaping the data

The 20,000 playlists were parsed through the Spotify API to retrieve the audio features for each song in each playlist, this in itself is a very time consuming operation due to the latency of the API calls. To give you an idea, 40 hours were needed to collect the features for all songs in each playlist within this sample dataset.

Once the features were collected, the average of them were taken to have the audio features for the playlist. With this, the dataset was reduced to a single row per playlist that represents the audio features for the playlist.

In a similar way, the user's favorite songs were collected and the same process was applied of computing the mean audio features across the favorite songs to have a single vector y that will serve as a target for the recommendations.

User’s favorite songs audio features

Modeling

I decided to go with an unsupervised approach where I am going to cluster the data and then predict the cluster for a given user and use that cluster only for recommendations.

From here, clustering is going to be developed and different algorithms are going to be used and evaluated to decide which one performs the best the the data at hand.

Seven different clustering algorithms were selected, they can be split in 2 groups; one group that don’t have a number of cluster as an input parameters (density models) and the other that uses a k number of clusters as input parameter (centroid based models).

Density Models (models without K number of clusters as parameter):

Affinity Propagation
DBSCAN
OPTICS

Centroid Models (models with K number of clusters as parameter):

KMeans
Birch
Agglomerative
Gaussian Mixture
Spectral Clustering

For the latter group, I am going to fit each model with k=2 to k=100 and score the resulting clusters with 3 different metrics: Silhouette, Davis-Bouldin and Calinski-Harabasz.

As always, the data is scaled before doing any processing.

Data Projection

With clustering, is a good idea to project the data in 2D or 3D to see if the results after clustering make sense. For this data, TSNE is used to project the data in a 2D space.

This big blob is a 2D representation of all the playlist in the data set, let’s start with the clustering to see what we can find out.

Clustering

For the density models, the clustering is not working as expected even after fine tuning the parameters. I decided not to move forward with this family of models because two main things: parameter tuning getting complicated and not scaling with the data. To train these models, it can take a considerable amount of time and space, even with a big instance on AWS (days and double digit GB).

For the centroid based models, each of them were trained with a k=2 to k=100 and each iteration was scored with Silhouette, Davies-Bouldin and Calinski-Harabasz. Spectral clustering model was discarded because it does not scale well with the size of our data set.

The Silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.
The Davies-Bouldin index, is defined as the average similarity measure of each cluster with its most similar cluster, where similarity is the ratio of within-cluster distances to between-cluster distances. The minimum score is zero, with lower values indicating better clustering.
The Calinski-Harabasz index, is the ratio of the sum of between-clusters dispersion and of inter-cluster dispersion for all clusters, the higher the score , the better the performances.

From the scores below, it can be seen how Davies-Bouldin Index is picking up the best on scoring each model for different number of k. All the models are having a good clustering score at around 15–20 number of clusters.

To visualize each model and its best k number of clusters, they are projected on the earlier TSNE representation. From these results, it can be seen, how KMeans is the best model to create clusters. This is supported by KMeans having the lowest Davies-Bouldin score and also the visualization is showing a better separation in clusters than the rest of the algorithms.

It is fascinating how one of the simplest and more straight forward models is the best performer, this is good for our analysis and for our data given the fact that it can scale well for the full 1 Million playlists.

Cluster Exploration

To explore each cluster, the playlist titles were extracted to discover what type of music is inside each cluster, these are the top 10 most common playlist titles for each cluster.

Top 10 most common playlist titles in each cluster

The clusters formed make perfect sense, by examining them Classical songs are in Cluster 4, Gospel and Religious music in Cluster 9, Latino music in Cluster 12, Rock and Heavy-Metal in cluster 3, etc.

The model developed is so powerful that out of 20,000 playlists it was able to distinguish and separate those playlists that contain stand up comedy and even more surprising, there are only 8, yes 8 playlist in the whole data set and the model was able to create its own cluster. This is mind blowing!

2 clusters that caught my attention right away were Cluster 9 and Cluster 15, these were interesting and unexpected findings but at the same time expected because they are different enough to be separated from the main blob and forming its own category.

Cluster 9: Gospel and Cluster 15: Stand-up comedy

Music Recommendation based on Clustering

After clustering the data and exploring the results, the clusters make sense and they are good quality clusters.

The music recommendation is going to be based on clustering, I know that each cluster has similar music in it and this is key part of our recommender system because each cluster will serve as a hard boundary and this allows to play with all the information for a given cluster.

By having a hard boundary, we can leverage that to use different metrics and different distances without the risk of recommending music that it is unrelated to the user’s music taste and also alleviating the issue of always recommending music that it is too similar to the user’s preferences.

The procedure for this recommender system is to gather the user’s favorite songs and compute the mean features, with our previously trained model, we can now predict in which cluster the user is going to fall, based on this, I will use the playlists belonging to that specific cluster and compute similarity.

There are 6 different ways to retrieve relevant playlists that will use 3 different similarity metrics (euclidean, manhattan and cosine) where for each metric the top n most similar playlist can be suggested to the user or the top n most “dissimilar” playlists. With the approach of most dissimilar, we can extract playlists that are far from the user but still within the hard boundaries of the cluster.

Once the top n playlists in the cluster are selected, the audio features are extracted for each song in those playlists. Then, each song is compared to the target vetor y and the variance for each song is calculated. With this information a new playlist is generated containing the top n songs with less variance to the target vector y (user's favorite songs).

As an example, here it is the prediction of a given user's favorite songs and its most similar and dissimilar playlists in the 2D vector space.

Similar and Dissimilar playlists for a given user

Successfully created and pushed playlist to Spotify

Bias

So far I have showed you the whole process of gathering the data, enhacing it, shapping it, model training and evaluation and giving the user recommendations. Everything is working as expected and the multiple times I have tried this with different inputs from other users it has been successful, however, there is a big problem with this data set because it is heavely biased towards the US market. As explained at the beginning, this data set contains only playlist from US users.

This in itself is an issue when trying to give recommendations to user's that listen to music that is considerably different from the US market. For example, a user that listens to music that is not contained in this data set, will be disappointed with the recommendations because none of the music recommended will be from those artist and environment that they listen to.

This might be obvious for some but it makes total sense, how can you provide recommendations for data that you don't have? This is easy to see and understand now after evaluating the system with people from different parts of the world but at the beginning, we never thought this would be an issue.

This problem is not a deal breaker for the product we built because at the end of the day we are recommending music and nothing happens if the music you listen to is not 100% tailored to your liking. However, now I now that if we want to build a product that can serve the population world wide, more data is needed.

It is imperative for us and for every data scientist out there to understand the data and the bias it has, because tomorrow we will be building products that can affect peoples life and we definitely don't want to leave everyone behind or to be unfair.

Future work

This data set has too much potential to keep working on it and I would like to list all the potential ramifications and future work that can be done with it.

Supervised learning:

The data has labels in the form of playlist names, a classifier can be built with this information in a way that a user can input a name of a playlist and have a classifier selecting the potential playlists to recommend.
An NLP approach can be developed in a similar way where all the playlists titles can be processed and have a predicted playlists title input by the user.

Unsupervised learning:

By collecting the lyrics for all the songs, a model like BERT can be used to retrieve those songs based on a word or phrase input by the user.
Read user's following playlists to extract their titles and do topic modeling to know what are the top 3 topics and recommend a playlist based on the results.

Deep learning:

A deep learning approach can be applied by constructing a RNN or a CNN where new playlists can be generated based on the playlists in the data set, potentially there is a relationship of how the songs were added to each playlist.