Finding similar contemporary songs

Published in

INST414: Data Science Techniques

4 min readApr 30, 2024

The question that I set out to answer in this analysis is how we can identify similar songs. Major music streaming services such as Spotify, Apple Music, Youtube Music and others likely use this question to determine what new music that they provide their listeners. Data that could answer this question would be a list of songs and their attributes such as genre, artists, and data on how the songs sound.

I spent a number of hours trying to acquire this type of data, there are many sources that offer this type of data but some of the datasets are in non convenient formats, have APIs with little documentation, or are hundreds of gigabytes in size. I tried to find a subsection of the Million songs dataset but eventually had to settle for a dataset of Spotify songs found on Kaggle (Spotify 1.2M+ Songs (kaggle.com)). The ground truth labels for this dataset were created by Spotify and included in this dataset and the same values can be found on the Spotify API.

This dataset was very clean already since it seems it was pulled directly from Spotify’s API and I only needed to clean one part. The artist names for each song wiere a list of strings so I needed to convert these to lists in order to compare them for Jaccard similarity later.

row.artists = row.artists[1:-1].split(',')

Due to the size of the dataset I did end up taking a random sample using Pandas of just ten thousand values instead of the entire 1.2 million rows that were included in the dataset.

The features that I chose to use for this analysis were tempo and acousticness, these metrics make sense for this task since songs with similar genres are likely to have similar sounds. Since both of these values are numerical and continuous in nature this is regression problem and we need to find the closest guess to the actual numerical values.

In order to run the analysis I needed to first get the key elements that I wanted to compare into a dictionary containing that information.

song_artist_map = {}
def createMaps(row):
    row.artists = row.artists[1:-1].split(',')

    song_artist_map[row.id] = ({
        'song_name' : row.name,
        'artists' : set(row.artists),
        'tempo': row.tempo,
        'energy': row.energy,
        'acousticness': row.acousticness
    })

music.apply(createMaps, axis=1)

This provided me with the following dictionary structure for each item

{'song_name': 84493,
 'artists': {" 'Dorothy Linell'", "'John Dowland'"},
 'tempo': 89.397,
 'energy': 0.034,
 'acousticness': 0.952}

Finally I was able to compare my predictions to the actual values.

For some reason all of my tempo values were incredibly off and my Mean Squared Error that I used in order to track my accuracy was 1190.37 which was incredibly high. I am not sure what happened to cause that much of an inaccurate value.

The acousticness mean squared error was much better at just 0.13 which was much better but I am not sure if that was just because the values for acousticness were already between 0 and 1 anyway.

Five of the values that I was not able to predict correctly that I took a look at were:

Una Despedida (con Un Traje Blanco) (57uRQ33TSw90jI3vMhULNW)
Grateful (0gEZ1v0rsSayZBi7qzrRgF)
Petite Fleur (1R4kdWYterXchflThmuFYE)
Eklabati (6K3nTRrQi61vE2MBBI0UDk)
Sonata nr.9 E flat Major: Allegro (5Vwvg2bR3NWghnJZcEkswj)

Looking at these results I reliazed that there were a few things that I don’t think I accounted for. First I only went into this analysis considering American music and not that foreign music would be included in the dataset. I also realized that sorting simply by artist might not be a perfect system since artists can cross genres for collaboration into songs that might have vastly different sounds then their normal music. Unfortunately I am not familiar with these individual songs or their artists so beyond basic googling I am not able to find specifics for these.

In addition to these factors other major limitations are not including the other data that was included in the data set when grouping similar songs. I think that if more of the data there was included then the predictions would be much more accurate. Also limiting the analysis to a single countries style of music would likely work a lot better.

GitHub: https://github.com/not-senate/module6_assignment

Data: https://www.kaggle.com/datasets/rodolfofigueroa/spotify-12m-songs?resource=download

Finding similar contemporary songs

Written by Jdavitz