Building a Song Recommendation System using Cosine Similarity and Euclidian Distance

6 min readAug 15, 2018

Streaming services have given us an unlimited amount of music options from tons of different eras. Personally, this has made it overwhelming to decide what to listen to when exploring different generations. To fix this I set out to build a program that would add songs to my Spotify library I was most likely to enjoy for any artist I wanted.

Connecting to Spotify Account and Creating User Profile

My first step was to create a user profile for myself based off my listening history. I could then compare this profile to every song by a given artist and add the most similar songs to my library. The Spotify API lets you access all of your most listened to songs by long term, medium term and short term streams. Each list is about 100 songs long so I decided to use these 300 songs to create my user profile, since they would be an accurate depiction of what type of music I enjoy the most. I used a python library called Spotipy that simplifies the process of working with the Spotify API. Below is the code I used to connect to my personal Spotify account through Spotipy, create a dictionary of all my most listened to songs and add feature values defined by Spotify to each song.

Using a similar method as the one shown above, I created a function that would take in an artist’s name and create a dataframe of all that artist’s songs. Now that I had all the data together, I needed to decide which features were most important to analyze. Every user has different music preferences so I wanted to make sure that my model looked at the most relevant features for a specific user. For example, some people may care more about tempo than others and some may care about the danceability of a song the most. To solve for this I ran a linear regression on each feature to find out which features influenced the songs I liked the most. I wrote a function that would run a regression on each feature and see how much it influenced the rank of a song in my most listened to tracks. The results of a given regression can be seen below. The three features with the highest R-squared values would be the features I used in the user profile. For myself the most important features were valence (positivity), key and energy.

Euclidian Distance vs Cosine Similarity for Recommendations

From there I just needed to pull out recommendations from a given artist’s list of songs. I decided to test this out using John Lennon. The below chart shows a plot of every single John Lennon song as an orange point in the 3-D space with my 3 features as the x, y and z axes (valence, energy, and key respectively). The blue dot in the plot represents the average feature value of all of my top songs. To find which Lennon songs were most similar to my profile I decided to test out both Euclidian Distance and Cosine Similarity.

I could have used a Model-Based Collaborative Filtering method, as most recommendation systems use. However, I wanted to get a deeper understanding of Cosine Similarity and Euclidian distance before going into these more complex methods. If you interested in all the Collaborative Filtering methods this article does a great job explaining https://towardsdatascience.com/various-implementations-of-collaborative-filtering-100385c6dfe0. The image to the left comes directly from this article and does a great job breaking down the possible options.

Cosine Similarity

Cosine similarity measures the similarity between two vectors by calculating the cosine of the angle between them. A simple visualization and the formula can be found below.

For my model, vector A remained constant as my user profile (the blue point from the earlier scatter plot). I then looped through all the John Lennon songs and plugged in each as Vector B to calculate the cosine similarity of each song to my profile. The concept shown above is the same for my example except my vectors are in a 3D space. We are still calculating the cosine of the angle between two vectors. To do this in python I used the code below.

Basically, the cosine similarity is the dot product of two vectors divided by the product of the magnitude of each vector. We divide the dot product by the magnitude because we are measuring only angle difference. On the other hand, dot product is taking the angle difference and magnitude into account. If we divide the dot product by the product of each vectors magnitude we normalize our data and only measure the angle difference. Dot product is a better measure of similarity if we can ignore magnitude.

The cosine of a 0 degree angle is 1, therefore the closer to 1 the cosine similarity is the more similar the items are. Based off cosine similarity, the Lennon song closest to my profile was ‘Champions Suite: Grand National’.

Euclidian Distance

Euclidian distance is the straight line distance between two points in euclidian space. The formula I used to calculate for each Lennon song vs my profile is shown to the left. Line d in the below chart is showing the euclidian distance between the two points. The angle being measure below is the cosine similarity of the two vectors. As you can see, euclidian distance is taking the difference of each magnitude into account. If the x value only increased for point B the euclidian distance would change, but the cosine similarity would stay the same. Based off euclidian distance, the Lennon song closest to my profile was ‘Just Because’.

Conclusion

In my example magnitude does matter so I decided to use the Euclidian Distance for my recommendations. Magnitude matters in this case because I want the songs that are the closest to my user profile averages and the magnitude of each feature value will drastically change how a song sounds. Therefore, the 5 songs with the smallest euclidian distance from my profile were the ones my model added to my Spotify library.

Cosine Similarity is more useful for instances when you do not want magnitude to skew the results. This is most useful in word vectorization because normalizing the data makes a long document comparable to a short document. The euclidian distance will be very large when between documents of different word length which would skew your results.