Predicting song popularity using Multiple Linear Regression

Anastasia Zhivilo
7 min readAug 29, 2023

--

Photo by Austin Neill on Unsplash

Link to full project code

Setup

This research focuses on exploring the different attributes of a song in Spotify and the decade that it was released to create a model that could help predict the popularity of a song based on these features.

The Spotify dataset was downloaded from kaggle. It rates different attributes of a song from 0 to 100 as well as the song’s popularity. It includes songs from 1900 until April 2021. The column descriptions can be found on the Spotify website.

Approach

For this analysis we will be using a Multiple Linear Regression from sklearn. First, the songs that were released in 2021 will be removed from the dataset (only the first 4 months of 2021 are available). Then the remaining data will be split into train (70%) and test (30%) and the model to be run on train dataset. Once a model is trained and tested on the test dataset we will use it to predict the popularity of the songs in 2021.

To get a better understanding of the influence of each release year on the popularity of a song the years were converted into dummy variables and each of these variable received a coefficient and a p-value score. The same was done for the duration of the songs by dividing the duration into buckets at 10th quantile each to make sure that each bucket has a similar amount of songs each. The buckets were then used to closer analyse if people prefer longer or shorter songs.

After the initial model training, the model was further improved by inserting non linear variables, such as the squared terms and logarithmic terms to capture any non-linearity in the variables, and by removing variables that were not statistically significant. The predicted popularity scores were then plotted on a heat map against the realised popularity to visually evaluate the performance of the model.

Once the model was trained it was fitted on the 2021 data and the predicted popularity scores were again plotted against the realised popularity scores. The coefficients and the p-value of the model were then used to analyse the data.

Training the Model on Pre-2021 data

First Model

The first run of the model gave an R-squared score of 0.395. Which means that the model explains 39.5% of the residuals in the data. Most variables were highly statistically significant with p-values of 0, except for ‘key’, ’liveness’ and a few release years.

Plotting the predicted popularity scores on a heat plot against realised we can see that there is a general linear trend, meaning that the model is on the correct track. The scores, however, are clustered, meaning that there are some variables we are not taking into account.

Real vs predicted popularity — First model

Second Model

The second model included non-linear variables of the attributes and had undergone extra cleaning of removing statistically insignificant variables. A variable ‘decade’ was also added instead of ‘release_year’ as the preferences may not change every year.

With the following changes the R-squared was improved to 0.405 or to 40.5% of the residuals being explained by the model. The variables are statistically significant with most having p-value of 0. Plotting the predicted and real popularity scores the model is smoother, although the same clusters remain. The clusters around the real popularity of zero are likely to be dragging the R-squared down.

Real vs predicted popularity — Second model

Fitting the trained model onto 2021 data

The model was then fitted to the 2021 data and the predicted population scores plotted against the realised scores. The model gave a lower R-squared score of 0.244. The graph still retains the vague linear shape with the songs that are of medium popularity being accurately captured in the model.

Findings

There are undiscovered songs with good ‘popularity’ features

During model training on pre-2021 data the heat map showed a small cluster of songs that the model ranks at mid-popularity but the real popularity is zero. This could be an interesting group of songs that may have the attributes of a medium popularity song, but were not created by a popular artist or generally were lost in the deep of Spotify.

New songs may take some time to be discovered

Similar to pre-2021 data the 2021 heat map also has a cluster of songs that are predicted to be somewhat popular, but the real popularity is almost zero, however it is a much larger cluster compared to pre-2021 data. Having been released only 4 or less months before the data was captured, it is likely that these songs simply haven’t had enough time to be found by the listeners. This shows that a good song can take more than 4 months to come to fame.

Some new songs are hyped up above predicted popularity

In the 2021 heat map there is a large cluster where the predicted popularity is 40 but the actual popularity is 70. These are likely to be of popular artists who have marketed their songs so they were heard by more people sooner and became popular sooner. There may be a lot of hype around these new songs, but that may die down after some time and eventually reach the predicted popularity.

Songs from 2000’s decade were the most popular, 1970’s second popular

Calculating coefficients for the non-linear feature ‘decade’ we get the graph below. The less negative the coefficient is, the more popular a song is from that decade:

Graph above shows the effect of a decade on the popularity of a song. Higher coefficient (less negative) means the song is more popular from that decade

In general the older songs have a stronger negative impact on the popularity score. Unlike the rest of the trend though, the decade of 2000 had the least negative impact (i.e. songs from that decade are most popular) compared to 2010 and 2020. This could be due to nostalgia effect with people having favourite songs from their formative years (or maybe 2000’s just had better songs).

The next decade that stood the test of time was 1970’s where the trend went up a little bit compared to 1960’s and 1980’s.

People prefer to dance to sad, high tempo songs with explicit language

Looking at other coefficients that were calculated based on the min, mean and max of the variables for the pre 2021 data the following attributes are preferred:

  • songs with less energy
  • high danceability
  • low valence — i.e sad songs
  • medium length
  • high tempo
  • songs with explicit language

The model is very good at predicting unpopular songs

In the pre-2021 graph there is a very bright yellow, narrow cluster where the predicted and the actual popularity is zero.

Recommended Further Improvements to the Model

Adding interaction variables

Patterns in residuals in the graphs tells us that there are variables missing from our model that describe those patterns. This could be due the interaction variables not being included in the model. For example, a specific decade might have preferred more energetic songs, or the use of explicit language.

Adding artists data

Adding artists to the model would also improve it as some songs may be popular just because they are created by a popular artist, even if they otherwise would not have been so popular. However, it is difficult to just put artists into the model because there are so many. A different approach should be taken by either categorising the artists into well known, or not known; or by how many songs they have released that are popular.

Removing unpopular songs from the model

The other improvement would be to remove unpopular songs from the model as the model seems to be good already at predicting 0 popularity. If the lowest popularity songs were removed, while the rest of the model stayed the same the R-squared could potentially improve significantly as the clusters at the bottom is currently weighing the R-squared down.

Conclusion

This has been an interesting data set which can help answer a lot of questions. This exercise has shown that it is important to remember that the variables may not be linear and their non-linear terms should also be included in the model.

Overall, our model has done a fairly good job at predicting the popularity scores despite the low R- squared. With more refinements, such as adding the interaction variables and the artists information into the model and focusing on the higher popularity scores by removing the lower ones would help to tune the model further.

There are still a lot of things the model can help discover. For example, it would be very interesting to examine those songs that scored high in our model but low in reality as there could be some hidden gems of good songs that just haven’t been discovered yet.

Link to project code

--

--