Muse: A Music Recommendation System

17 min readDec 16, 2018

By: Brendan Ngo, Charlie Yeng, Nick Amaya, and Nick Nieman

Abstract

With the release of tens of thousands of new songs each year, users often find it difficult to discover new music that matches their interests. Streaming music services have attempted to tackle this issue with recommendation systems based on machine learning techniques. While those companies have made notable progress, we hope to further improve the quality and diversity of recommendations. Our approach to this problem involved implementing several models including: K-Nearest Neighbors, Item-Item Collaborative filtering, Matrix Factorization, and ensembling. We evaluated each in addition to ensembling our results.

Our work is loosely based on the Million Song Dataset Kaggle competition listed in the reference section. We used the competition as a starting point to obtain data and structure the output of our models.

Background/Summary

GOAL

We aim to identify the best solutions for music recommendation. Built on the work of previous researchers, we intend to build and test a variety of models and determine which performs best with little user information. The models will be trained on half the listening history of many users and will attempt to predict the remaining half.

DATA

We utilized the million song dataset, consisting of a large variety of features. The core dataset consists of millions of triplets representing a user, a song they listened to, and a count of how many times they listened to that song. Other partner datasets link in various properties, such as song information (energy, danceability, tempo, etc).

METHOD/APPROACH

Before working with the data or building any models, we researched the field of recommendation systems and determined a few key categories: collaborative filtering, content filtering, model-based system, memory-based systems, user-based systems, and item-based systems. See the hierarchy in Figure 1 to better understand how the categories relate to each other. We decided to choose a variety of models representing as many of those categories as possible (boxes in burnt orange). Most of our models focused on collaborative filtering since they tend to perform much better than their content filtering counterparts.

Figure 1. Recommendation System Model Categories

Once we determined our models, we began building an elementary version of each. After the model successfully output a list of recommendations, we began tuning the models and adding more complex functionality. In matrix factorization, for example, we added biased terms and a regularization parameter. Once we created a series of models, we used ensembling.

Each model output a list of 500 recommendations for each user. This user list was then used to calculate a variety of metrics to determine which models scored best under certain scenarios.

DEFINITIONS

Content Filtering: Content based filtering uses the similarity between items in a system, which in our case is songs. The system will analyze items a user has rated highly and recommend similar songs. For example, if a user enjoys rock music, a content filtering model will continue to recommend rock music. While this is effective in ensuring recommended items will be similar to a user’s tastes, the model is not able to recommend new types or genres of items to the user.

Collaborative Filtering: Collaborative filtering mimics how recommendations generally travel, word of mouth. In essence, collaborative filtering algorithms recommend items to a user based on items other similar users prefer. Similar users are users that have similar item rankings. For example, if user A likes items X, Y, Z and user B likes items W, X, Y, Z, a collaborative filtering model will recommend W to user A. Because of this community-based approach to recommendations, collaborative filtering models often outperform their content filtering counterparts.

Memory-Based System: Memory-based recommendations systems rely on previous data to determine recommendations for new users. They use relatively simple techniques to measure similarity, such as pearson coefficients or cosine similarity. Similarity to previous data points is used to make a recommendation (i.e. KNN).

Model-Based System: Model-based systems model each user and item, such that the probability a user will like an item can be determined from the model. This is true even if other users have not seen the item before, which is useful for systems where items are frequently added to the item list (i.e. new song releases). This method only directly uses the data set for training and not for making recommendations. These models can also be faster and save storage space compared to memory-based systems.

User-Based System: In a user-based system, the items a user has already liked are used by the model to find a community of users that have similar interests. Users that like similar items will be in the same community. Once communities are established, item recommendations for a user in the community are pulled from items other users in the community liked.

Item-Based System: In an item-based system, communities of items are formed rather than users. These communities are created based off users preferences, however they ultimately become independent of a user’s preference. This system only works if we assume that users’ preferences vary very little over time. Once communities of items are established. Users that like a song in a specific community will be recommended other songs in that community.

APPLICATION

Our proposed recommendation system easily lends itself to modern applications. Major music corporations like Apple, Spotify, and Pandora have long been using some form of machine learning to assist in their recommendation process. Our system could be used to supplement these systems.

More than likely, robust recommendation systems are ensembles of various models and techniques. As a result, our model’s recommendations could easily fold into existing recommendation models.

PERFORMANCE

We evaluated the performance of our models with two metrics. The first metric is recall, which Kaggle used to score its models. It is calculated by dividing the total sum of correctly predicted songs out of the second half of the users listening history over the total number of songs in the second half of the users listening history. We also measured precision, which uses the same numerator but divides by the total number of recommended songs.

We calculated these metrics for each of our models with various quantities of recommendations (ranging from 10–500).

DATA

Our data comes from the Million Song Dataset (MSD) provided by The Echo Nest. It consists of both song and user data. The song dataset is composed of 28 datasets containing audio features and metadata for a million contemporary popular music tracks. We also had access to the entire listening history of one million users for training and half the listening history of 110,000 users for testing, with the goal of predicting the other half of their listening history.

SIZING THE DATASET

The entire dataset contains about 280 GB of data. Because of the size of the dataset, we could not download it directly. In order to go around this, we used a subset of the dataset, which included 10,000 songs, for local testing. We also utilized cloud resources to gain full access to the data. The full dataset is hosted on AWS as part of its Public Dataset Program, so we used an AWS EC2 instance and attached the dataset as an EBS volume.

Most of the data is composed of the songs and their features. These features include song information (artist, the release year, duration, etc), song features (tempo, time signature, pitch, lyrics, etc), and derived features (hotness, danceability, etc). This data is embedded in a collection of compressed HDF5 files, where each HDF5 file contains all the information for a single song. We did not plan on using all the features provided, and we found that iterating over the dataset files was extremely slow. As a result, we created a SQLite database containing a subset of the features for each song. The features we included were the track and song ID, title, release year, artist name, duration, artist familiarity and hotness, danceability, energy, tempo, loudness, and song hotness. Using this database, we could then query the information we needed much more efficiently. Using this, we were able to convert 280 GB of data into a 1.3 MB database.

ANALYZING THE DATA

We first looked at the dataset in aggregate and computed basic statistics (count, mean standard deviation, min, max) for each of the numerical fields. This is summarized in the following tables.

Table 1. Basic statistics on the 10,000 song subset

Table 2. Basic statistics on the full 1 million song dataset

We then plotted some of fields in order to get a sense of the distribution of the data. As shown in Figure 2, we found that key fields such as artist hotness, tempo, artist familiarity, song hotness, and song duration all follow a gaussian distribution, which can become important depending on the learning model. The only feature that isn’t gaussian is year, which is expected given the increase in music production over the past few decades.

Figure 2. Histograms of key numerical fields

We also looked at the correlation between some of our variables we thought might be important in Figure 3. We found artist familiarity and song hotness to be slightly correlated, which makes sense as popular artists generally create popular songs that reach a large audience.

Figure 3. Song Hotness Correlation Graphs

DATA PROCESSING

We processed a lot of our data during the creation of our dataset using feature selection. We omitted certain features based on intuition and kept the ones we thought might have the most impact, such as song and artist hotness, artist familiarity, and danceability. We also found that there were a lot of missing values in the dataset, such as NaN’s and 0 for the year. When creating the database, we accounted for this by filling in missing values with -1. This way, we could change how we handle the missing data based on the model. Due to the pure size of the dataset and the number of features and data available to us, we were not able to fully analyze the data. We used intuition and basic analysis to process the data, but, given more time, we would like to revisit this section.

LEARNING/MODELING

The large size of the Million Song Dataset lends itself well to machine learning techniques. We tried a wide range of techniques in the collaborative filtering and content filtering realms to analyze which performed best. A common issue we faced was model training time. Our models took a substantial amount of time to run, even on a small subset of the data. The models we used included: logistic regression, user-user model (k-nearest neighbors), item-item model, matrix factorization, and ensembling. Additionally, we ran models using random song selection and highest popularity ranking as a baseline for evaluation.

RANDOM MODEL

As a baseline model to compare against, the random model recommended songs to a user by randomly selecting songs from the whole set. Including this model in our analysis allowed us to paint a better picture of our other models. If another model failed to significantly beat the random model, we knew the model was ineffective.

POPULARITY MODEL

Our first approach was simple. In order to maximize the probability of finding a song in the second half of a user’s listening history, we recommended the most listened to songs in the first half. Therefore, we created a model that maintains a list of all songs sorted by how frequently that song appears in the listening histories of users. Then, we recommend the first song on the list that the user hasn’t listened to. We dubbed this model the popularity model since the only criteria the model looked at for recommending songs was the song’s popularity.

LOGISTIC REGRESSION

Next, we tried a content-based model that predicted the probability a user will listen to a song based only on the song’s features (such as artist, album, tempo, etc.). To do this, we trained a logistic regression model for each user on the song features, where the label was a one if the user had listened to a song and a zero if they had not. We then used these logistic regression models to recommend songs by predicting the probability that a user will listen to each song.

USER-USER MODEL (KNN)

We implemented a collaborative algorithm based on nearest neighbors. Intuitively, if two users have similar listening histories, then they likely have similar tastes in music and can recommend new songs to each other. Using this assumption, we made recommendations by finding closely related users based on their listening histories. With this information, we recommended songs from each other’s listening histories. For each user, we sort all other users, in decreasing order, based on how many songs the two users have in common in their listening history. Starting with the most similar users, we recommend songs that the original user has not listened to until we have made all our recommendations.

SONG-SONG MODEL (KNN)

Our next model analyzes similarities between songs (derived from user listening history). For example, if several users all like five of the same songs, our model dubbed those songs as similar. The approach utilizes a co-occurrence matrix that has the user’s songs along the length and all songs in the database along the width. Then, depending on listeners that are shared between the two songs, the co-occurrence matrix gets populated with a similarity score. The scores are then sorted from greatest to least, and the top recommendations are selected from those scores. The problem with this approach is the sheer number of calculations that must be performed for each user in order to generate their recommendations. Each song is compared to approximately 300,000+ other songs, so the computational complexity grows rapidly. Our solution to this problem was using a pre-computed matrix and performing a simple look up based on song index to reduce the amount of computations.

MATRIX FACTORIZATION

At its core, matrix factorization (MF) is a collaborative filtering, model-based recommendation system. It uses similarities between various users’ song preferences to recommend new songs. For example, two users who may have similar tastes will be recommend music that the other has liked. While this is similar to other memory based models, such as KNN, the model-based nature of MF generally results in increased accuracy and other benefits.

Before beginning, it’s important to describe the structure the model operates under. It assumes there exists a matrix A such that each row is a user and each column is a song. The cell of a specific (user, song) coordinate represents the rating a user has given to that song. Since the training data consists of hundreds of thousands of users and songs, the matrix is very large but very sparse. MF decomposes information in this matrix into a user vector and a song vector. See figure 5 below, which visualizes the decomposition structure.

Figure 5. Matrix Factorization Structure

Each entry in the user or song vector is composed of latent factors. These latent factors characterize each user or song. Even with billions of users and songs, it is improbable that the latent factors of a user or a song are equivalent. In figure 5, there are two latent factors (1.2 and 0.8 for user A). It’s also important to note that the latent factors mean the same things between each user and song. For example, the first latent factor of user A may represent how much a user likes high-tempo songs. Song W’s first latent factor and user B’s first latent factor also describes the degree to which each user or song likes/is high-tempo. To calculate how much a user likes a song, take the dot product between the latent factors of a song and a user. For example, the degree to which user C likes song Z is: 1.5 * 0.8 + 1.0 * 0.4 = 1.6

Once this same score is calculated for every user-song combo, the MF model just selects the highest rated songs for each user.

The difficult part is creating these vectors. MF determines the latent factors present in the user and song vectors using coordinate descent. For the ratings that are provided by a user, we are able to use a cost function to determine how well our model predicts the rating a user gives a song. For example, in figure 2, user C ranks song Z at a 2.0, but our model predicts 1.6. With this information, we can create the cost function below, where Aij is the actual rating of a user song combo , Âij is our model’s predicted rating of a user song combo, i in the user index, and j is the song index.

The algorithm begins by filling the user and song vectors with random noise. It then iteratively updates both vectors using the loss function and coordinate descent. The equations used to iteratively update the user and songs vectors are listed below.

To increase the accuracy of our model, we added two bias terms for users and songs. These account for a user who may give abnormally low ratings to most songs, or a song that garners abnormally high ratings. We also included a regularization parameter in our final code. Like the user and song vectors, the user bias and song bias vectors are updated iteratively. Thus, the final set of equations are listed below, where u is the user vector, v is the song vector, b is the user bias vector, and c is the song bias vector. Lambda is the regularization parameter.

These will need to be iteratively updated many times (epochs) before converging on the data. In our experience, 1000 epochs did not converge to a point. Figure 6 depicts the increase in recall as the number of epochs increases. Thus, we can conclude that the model is converging, but needs more time to do so. The recall score was calculated with 500 recommendations per user. We are continuing to look into ways to speed up convergence, such as only selecting the most popular 100,000 and least popular 100,000 songs.

Figure 6. Matrix Factorization Recall Score

ENSEMBLING

To improve our results, we attempted a couple different ensembling techniques. The first technique uses a majority voting system, but weights the importance of a model’s vote by it’s recall score. Thus, more accurate models are weighted more. Since we generated several model types, we tried various combinations. The best performing combination was a weighted majority vote of our top three models: User-user, Item-item, and popularity. We perform a weighted majority by taking all the different permutations of intersections between the three models. We first add songs from the intersection of all three models, then those with KNN, and finally the intersection of Item-Item and Popularity. Finally, we fill the rest of the recommendations with songs from KNN that are not recommended yet. The results are detailed in the results section below.

We intend to implement Borda count and other ensembling techniques in future work.

RESULTS

For the music recommendation problem, we want models that are good at predicting the next song a user will listen to given a small number of recommendations. In other words, we want a model to be able to predict as many relevant songs as possible in as few recommendations as possible. Bearing this in mind, we evaluated our models on two competing metrics: mean precision and mean recall.

Mean precision is a measure of how well a model recommends relevant songs in relation to the number of recommendations the model returns. For example, if a model recommends 20 songs to a user and was able to correctly predict 15, then the model’s precision is 15 / 20 for that user. Mean precision is just the average precision the model scored across all users.

In contrast to mean precision, mean recall is a measure of coverage. It measures a model based on how many songs a model recommends correctly in relation to the total number of correct songs the model could possibly recommend. In our problem, this means that if a model predicted 8 songs correctly for a given user and the user has 12 songs in the second half of his/her listening history, then the model covered 8 / 12 songs for that user. Again, mean recall is just the average recall the model scored across all users.

We scored our models on both of these metrics and plotted the results.

We first noticed a model’s precision score decreases as the number of recommendations increases. In contrast, their recall score increases as the number of recommendations increases. This lines up with our intuition that precision measures a model’s performance versus the number of recommendations the model gives, and recall measures a model’s performance regardless of the number of recommendations.

Furthermore, we see that both figures show that neighborhood-based methods (nearest neighbor and item-item) scored the highest, with nearest neighbor narrowly edging out item-item. The fact that both of these methods out performed all of the other models, tells us that our data is most suited to collaborative filtering methods and future work on this data should consider building on top of these models. It’s also worth noting that our item-item model performed better than other models when only recommending 10 items. Thus, we may consider using the item-item model when recommending a few songs and user-user when recommending many songs.

Last, both figures 7 and 8 illustrate the superiority of ensembling. In both metrics, ensembling takes the top score in almost all cases. This is unsurprising given the well known strength of ensembling. Our ensembling model took into account three models: User-user, item-item, and popularity. In future work, we’ll continue to try various combination of models for ensembling, in addition to varied techniques.

FUTURE WORK

Our work thus far has shown that collaborative filtering methods hold a lot of promise in the music recommendation problem domain. Combining these methods with current state-of-the-art recommendation models seems like the clear path forward. We are currently looking at combining a learning to rank model such as LambdaRank with our current lineup of models.

LambdaRank attempts to solve the song recommendation problem by learning an optimal ordering of songs based on some criteria and recommending the top song of that ordering. The criteria that LambdaRank optimizes is based on a standard information retrieval evaluation metric, such as mean average precision, mean reciprocal rank, etc. In our case, we optimize on mean average precision since that is the scoring metric of the kaggle competition.

We can combine LambdaRank with our models by using our nearest neighbor model to grab the M most similar songs to a user. We map each song to it’s song embedding vector taken from our matrix factorization model. We then match each song embedding with a relevancy rating of that song to the user, which is just the number of times a user has listened to that song. Next, we feed those vectors and relevancy ratings into our LambdaRank model, which outputs the optimal ranking of those songs. Finally, we recommend the top N songs from this optimal ranking to the user (where N < M).

CONCLUSION

The music recommendation field is still relatively new, but data scientists have devised several inventive techniques. In our project, we attempted to add to the research in this field by implementing and ensembling various models. We began by feature engineering our large dataset, reducing it to a more manageable size, and utilizing an AWS instance to host the data. After researching and implementing several models on the data, it’s evident that our collaborative filtering models generally outperform our content-based models. In particular, our neighborhood-based models (user-user and item-item) performed best. It was our ensembling of several models, however, that performed the best overall. In future work, we will continue to tune our current models and implement more complex models, such as LambdaMart.