Searching for Similar Songs on Spotify — Data Science

Leonardo Mauro P. Moraes
Sinch Blog
Published in
4 min readJan 12, 2021
Photo by C D-X on Unsplash

Idealized by Guilherme Muzzi.

Case Study

Have you ever asked yourself how can we recommend similar songs based on a listened song? In this article, we describe a simple method to do that.

Similarity is an answer. Similarity measures how much two songs are similar, in order to create a fine recommendation for the users based on previously listened songs. In this study, we compared songs based on broader features, such as acousticness and danceability.

Disclaimer. This is a simple case study of similarity. There are many state-of-art algorithms for song recommendation. However, this algorithm is an interesting and generic approach that can be used in many other situations. It can also be used as a base test for your experiments.

Dataset

First, we need the data. In this sense, we used the Spotify Song Attributes dataset, which contains 2017 songs from GeorgeMcIntire’s playlist, with songs attributes from Spotify’s API.

Each song has a set of interesting attributes, such as acousticness, danceability, energy, instrumentalness, loudness, etc. In this sense, a song is represented by a set of attributes, i.e., its characteristics.

Similarity

Okay, now we have the data. But what is similarity?

Similarity is a measure that calculates how much two objects have similar shapes, values, or distances. If we observe our songs as data points, we can measure the similarity of two songs using a distance function. Wait… what? A distance function is a function f(x1, x2) that measures how much x1 and x2 are close - so how much they are similar.

Distance function in a Cartesian space

See the image above. The yellow ball is more similar to the blue ball because they are closer when compared to the other green balls. So, a distance function is used to compare the distance between the balls to find the most similar one. There are many distance functions; in our case study, we used Euclidean distance, represented by:

Euclidean distance formula

Algorithm

How can we search for similar songs in our dataset?

In this case, we applied the k-nearest neighbors (k-NN) algorithm. The k-NN searches for the similar elements based on a query point at the center; for example, “this song is similar to this”. Basically, the k-NN:

  1. Measures the distance from the query point for all songs.
  2. Sort the songs based on proximity.
  3. Return the k most similar songs.

Song Recommendation

Now, we have a set of ~2000 songs and a k-NN algorithm that searches for k similar songs, using Euclidean distance as similarity function. Let’s try!

We randomly selected Avril Lavigne — Complicated as a query point.

Avril Lavigne — Complicated

And the k-NN returned the three most similar songs: (1) DOLF — Fuck It All Up; (2) Sam F — Limitless; and (3) ASTR — Blue Hawaii; respectively. If you check the query point and the other songs, they are pretty similar — they have similar acousticness, energy, and rhythm.

Discussion. However, the third (ASTR — Blue Hawaii) is not so similar to the query point. This can happen if we have a small amount of data, or if we do not know enough about the song to know what it is.

But, we successfully elaborated a simple song recommendation algorithm.

Other Queries

Is that it? No, we can also apply our algorithm to answer other queries, such as: (1) “What is the most active, cheerful song?” or (2) “What is the less active, or not energized song?”. In this way, we have to create a synthetic query point.

To create a synthetic query point, we only have to set the attributes we want to. For example, to answer the first question, we can set the values of danceability, energy, and valence to the highest value possible. We then create a query point to the song with the highest value.

This query returned Gwen Stefani — Hollaback Girl as the most cheerful song.

Gwen Stefani — Hollaback Girl

And the opposite query (danceability, energy, and valence to the lowest value possible) returned Nikolaus Harnoncourt — Mozart: Requiem in D Minor, K. 626: VIII. Lacrimosa is the less energized song.

Nikolaus Harnoncourt — Mozart: Requiem in D Minor, K. 626: VIII. Lacrimosa

Conclusion

Similarity functions are powerful approaches to execute queries in datasets in order to retrieve similar elements based on a query point. In this way, we can use similarity to create a simple recommendation algorithm. Furthermore, we can explore many other things, such as (1) change the distance function, (2) modify the algorithm, or (3) create other synthetic query points.

--

--

Leonardo Mauro P. Moraes
Sinch Blog

Is a Machine Learning Engineer and Team Leader… working with data-related products