Spotify audio features — preliminary data analysis

Do audio features correlate with/affect one another? And can you leverage these features to cluster songs, aiding in song recommendations?

Meehir Bhalla
INST414: Data Science Techniques
4 min readFeb 17, 2022

--

An insight I want to extract from the Spotify API is if audio features in songs correlate with/influence one another. This insight can inform decisions being made by artists/producers or Spotify developers regarding tweaking certain features to generate more streams (an increase in popularity) or be used to cluster similar songs together, which can aid in song recommendations. For example, if tempo has a strong positive correlation with popularity, then producing faster tempo songs might have a positive effect on popularity (this is not always the case correlation does not always mean causation). Further, songs with a faster tempo, high speechiness (Spotify’s indicator of spoken words present in a song), and explicit lyrics can be used to identify rap/hip-hop songs, while songs with high instrumentalness, high energy, and low speechiness can be used to identify electronic dance music (EDM) songs. Making assumptions backed by these features can also be used to further influence other insights, like recommending similar songs to ones already in a playlist.

The data that could answer these questions can be accessed through Spotify’s public API. Spotify’s API allows you to pull insights, like audio features and song data (artist, song name, album name, etc.) using the spotipy library. Using my personal Spotify account, I accessed a playlist containing a wide variety of genres and extracted audio features and song data. These extracted features are relevant to my question because they can be compared to test hypotheses on correlation/causation, as well as compare features of similar songs to see if genre can be determined by audio features. For example, what makes song A more popular than song B? Is it the faster tempo? The higher danceability?

Spotify Developer Dashboard

In order to access the Spotify API and the features of my playlist, I used the spotipy library and SpotifyClientCredentials (to authorize my credentials). After successfully authorizing my client id and secret id, I grabbed a playlist containing genres ranging from r&b to K-pop as a JSON file. I converted the JSON file into a pandas dataframe by creating a dictionary containing song information like the album name, song name, artist, etc. I then used the sp.audio_features method to populate another dictionary containing musical features in correlation to the song id. Finally, I merged them together to create a cohesive dataframe containing songs correlating with their respective features.

pandas DataFrame containing song data and audio features
A slight positive correlation can be seen between Popularity and Tempo
Differences in clean and explicit songs in my playlist

Some bugs I encountered while trying to access the Spotify API and create my dataframe were songs in my playlist being deleted by Spotify or being local files. This was a problem because they did not contain any data that could be accessed by Spotify, causing many errors in my code. For example, when trying to create a dataframe of audio features, songs not supported by Spotify did not contain any values and the audio features dictionary could not be converted into a pandas dataframe. I fixed this by removing songs not verified by Spotify, making sure the playlist was made up entirely of songs currently available on/supported by Spotify.

“molly” by jxmper. is greyed out since it has been removed by Spotify

Some limitations of my scraping approach were that I was not able to obtain the genre which is not bad since I want to be able to group songs by audio features, but the genre could be a helpful reference when evaluating my results. Another limitation was that the values for each feature are all very different, in that some have smaller ranges or negative values and others range from 1–100. Scaling my data will be very helpful in that it will make all values standardized/normalized. Moreover, using these normalized values will help produce more accurate results, especially when comparing values and pulling insights from comparisons, and will be easier to interpret.

--

--

Meehir Bhalla
INST414: Data Science Techniques

Undergrad @ UMD — College Park ~ Passionate about big data, data science, data visualization, and machine learning. I enjoy cooking, running, and music! 🌀