Predicting Song Popularity

Published in

Analytics Vidhya

5 min readMay 8, 2020

A linear regression project using Spotify song data

This project idea recently came to me after participating in a bit of Zoom quarantine fun — a Zoom facilitated music bracket. The week prior, each participant was tasked with nominating four songs that they felt the group did not know but would enjoy. All participants spent a week listening to the choices and prepped for casting their votes for each matchup of songs. Spoiler alert: my songs did not go far — songs that I was so sure of, that I personally listened to over and over again. My failed choices left me seeking to understand if song popularity can be predicted and what that looks like. And so my quest to build a prediction model for song popularity began…

The Dataset

I started by sourcing a Spotify dataset from Kaggle that contained the data of 2,000 songs. It included my target variable, a popularity score for each song. It also included the bulk of my explanatory variables — audio features such as BPM, valence, loudness and danceability as well as more general characteristics such as genre, title, artist and year released. I was mostly content with all of my possible features, but as an avid Spotify user, I knew that Spotify keeps a follower count for each artist. I felt that this could be a great addition to my predictors of song popularity, so I used python to make API requests to the public Spotify API to gather this count for all my of songs.

I merged my two datasets on artist name and began the process to clean the data for modeling using pandas. My main points of cleaning were:

Assigning all NaNs for follower count (my API requests were mostly successful but I had to manually look up and hard code in a few)
Consolidating genres down from 190 ‘unique’ genres to around 30 genres
Creating dummy variables for each genre and removing the original genre column
Deleting title and artist columns
Creating a new feature for the total # of words in each title (I thought this may be impactful)
Creating a new feature in place of year, ‘years since released’

EDA & Statistical Testing

The next step in my process was to utilize exploratory data analysis and statistical testing to gain further insight into my dataset. I used matplotlib, seaborn and pandas for the EDA. For statistical testing, I utilized scipy and statsmodels.

Before getting into modeling, my goal was to get a deeper understanding of the relationship between my target and feature variables, as well as a better grasp on how my features related to one another.

A correlation heat map of all my variables, plotted with Seaborn.

As you can see from the above heat map, my correlations were pretty low across the board and in every direction.

A scatter matrix to showcase the relationships between different features and my target variable.

Again, as shown above, the relationships between each of my features and target variable were largely non-linear. I began to suspect that I would need to transform my variables and create interactions to deal with the non-linear relationships and low correlations.

Feature engineering — Polynomial Transformation

After my EDA and running a baseline linear regression model, I applied polynomial transformation to the 2nd degree to all of my song audio features. This created interactions among the different song elements, which in hindsight really made sense because it’s the combination of elements that make up a song. A song is never just one audio feature.

Modeling

I trained and tested linear regression models using statsmodels and scikit-learn.

For my first model, I used one feature that seemed to have the highest correlation with popularity, artist follower count. I thought this feature would impact the popularity score the most. But as you can see above, it wasn’t very insightful with an R-squared value of .09.

My second model that I ran used all of my original features as well as all of the interaction features created via polynomial transformation. It performed significantly better. However, with the proportion of 85 features to my dataset of 2,000 — I knew that I needed to cut down my features and only include those that really had an impact to avoid multi-collinearity and overfitting.

After testing out a few different selection methods, such as RFECV,VIF and Lasso. My model utilizing Lasso feature selection performed the best with an R-squared value of .28 and my explanatory variables were narrowed down to 34.

Conclusions

My final model wasn’t as predictive as I had hoped, explaining only 28% of the amount of variation in song popularity. However, after analyzing my coefficients, there were a few takeaways to be noted. The following features had the most positive and negative impact on popularity.

Most positive:

Number of years since release
Artist follower count
Danceability

Most negative:

Indie Genre
Acousticness & Speechiness

Final Thoughts

All in all this was a fun and somewhat insight project. To increase the predictive power of my model, I would like to try further degrees of polynomial transformations to find better interactions. I also would like to consider other explanatory variables that could be added into my dataset. Additionally, it might also be worth exploring other types of models that would be better suited to this dataset.