Predicting the Next Hit Song!

Using Spotify’s Audio Features to predict whether a song will be a Billboard Hot 100 Hit

Chris Chan

Published in

Nerd For Tech

8 min readMar 26, 2021

Have you ever listened to a song for the first time and you couldn’t help but say to yourself…

This song is going to be a massive hit!

…Or maybe it already is and you’re just 2000-and-late…

Well, either way, did you stop and ponder why you thought it might become (or already is) a hit song? Was it a catchy tune? Or maybe it just hit all the right notes given life’s circumstances.

It’s moments and questions like these that set in motion my 3rd project in the 12–week Immersive Data Science Bootcamp at Metis. That is, I set out to answer the question…

Can we predict what the next hit song will be by analyzing former hit songs?

The rest of this blog post will describe the general approach I took to answer this question but here’s the high-level work flow:

Let’s provide some context before diving further into the data and analysis.

What is a hit song?

For purposes of this analysis, a “hit” song is defined as any song that made it to the Billboard Hot 100 Weekly Charts. Billboard’s rankings are based on sales, radio play and online streaming in the United States.

What data did we use?

We had roughly 9,000 Billboard Hot 100 songs between the years 1990–2018. These were considered our targets for classification modeling (more on that below). To build an effective classification model we needed a balanced dataset that also included non-hit songs. Therefore we obtained a sample of roughly 5,000 non-hit songs from the Million Song Dataset which is a collection of…(drumroll)…a million songs! As defined by the website, the “Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.” Extra shout out to the amazing data scientists before me who had made data publicly available to get started (Billboard , Million Song Dataset).

For each song in our dataset (of roughly 14,000 tracks) we scraped audio features from Spotify using their handy API. Spotify creates metrics that measure a song’s characteristics such as how danceable a song is (Danceability) or how instrumental (Instrumentalness) it is. They put these on a 0 to 1 scale so you can compare songs based on these features.

Let’s take a look at an example.

Spotify Audio Features: Hits vs. Non-Hits

This first chart is a look at 10 of the audio features and compares their mean scores between “hit songs” and “non-hit songs”. Billboard hits are in blue and non-hits are in orange. The differences between them are where there is no overlap. The main takeaway here is that on average, “hits” have higher Danceability and Energy, whereas “non-hits” have higher Acousticness, Instrumentalness, Liveness and Loudness qualities.

A closer look at a few features just highlights how the distribution could be quite different between “hits” and “non-hits” even though their means might be close.

Distribution of Specific Audio Features: Hits vs. Non-Hits

Modeling Techniques

As I moved into data modeling I contemplated what evaluation metrics were most important. Music labels or even independent musicians ultimately want to maximize their ability to predict a hit song correctly. So with that, precision became the main evaluator for this analysis. Why precision? Remember that precision in our context evaluates how well our model correctly predicts whether a song will be a hit or not:

Precision = True positives / (True positives + False positives)

Basically, of all of our predicted hit songs, how many were actually hit songs?

Recall measures how many actual hits we predicted correctly.

Recall = True positives / (True positives + False negatives)

If a song became a hit even though we predicted that it wouldn’t be (false negative), then we didn’t really lose much given the song eventually became a hit and in our case we can afford to miss a few.

Moving on to the modeling I ran the following baseline classification models for comparison:

Model Results:

I thought it was important to look at the basic evaluation metrics as a whole. I ended up looking at the balance of accuracy, precision and recall to make a first round cut of model choices. Based on this, I decided to examine the XGBoost model to tune further.

Model Tuning

Comparing my precision and recall scores between training and test sets, it was evident my initial model was overfitting quite a bit so I decided on two things for model tuning:

To look at the feature importance relative scores
And to find the best parameters using a grid search

As a result of my model tuning, I was no longer overfitting and I ended up with a precision score of 0.795. Given I wanted to increase my precision a bit more without giving up too much on recall I decided to increase my probability threshold from 0.5 to 0.65. Below is comparison of the confusion matrix results at threshold levels 0.5 (left) and 0.65 (right). Just as a refresher, the confusion matrix helps us to gauge the performance of our classification model and offers a visual of how our precision and recall change based on the probability threshold tuning mentioned earlier:

Confusion Matrix:

This boosted my precision to around 0.84 while maintaining a reasonable recall at 0.8. This was my final model and these scores are based on the test data.

Analysis Results

So what does all of this really mean? How well is this model really working? In order to help answer that I thought we’d look at a few examples of:

Not-so-good predictions: Example where a song was an actual hit but we predicted as a non-hit (predicted probability < 0.65)
Edge cases: Same as above but we almost correctly predicted it as a hit song based on the probability score (a score just under 0.65)
Good predictions: Example where a song was correctly predicted to be a hit song (predicted probability ≥ 0.65)

For each of the examples below, I used Post Malone’s “Better Now” song as an example of a good prediction or sort of the gold standard of what a “hit song” looks like (since this got a predicted probability of 0.91). Then I compared three songs against it to show how the model performed. The first is an example of a not-so-good prediction:

Not-so-good prediction

Looking at the graph above, the blue portion represents audio features for the song, “Better Now” by Post Malone. Again, this was a correct “hit” prediction with a predictive probability of about 0.91.

The orange portion represents audio features for a song by an artist named Burl Ives. If you haven’t heard of him you’ve almost certainly heard his music. He wrote “It’s a Holly Jolly Christmas”. Ring a bell? Well, unsurprisingly and as seen in the graph, the two songs don’t have many audio feature similarities yet the latter made it to the Billboard Hot 100 and my model did not predict it to be a hit song (predicted probability of being a hit of 0.03). How did this happen?

Recall that this analysis focused on hit songs for years on the Billboard Hot 100 between 1990–2018. We now realize that a song written well before the 1990’s, such as a seasonal tune like Holly Jolly Christmas, could get frequent airplay in more recent years, thus landing on the Billboard list. My model doesn’t account for this type of seasonality yet, therefore we may have missed some hit songs that were not indicative of your prototypical pop song of the current moment.

Let’s take a look at another missed-hit prediction, but one that was fairly close:

Edge Case

Recall that I increased my probability threshold from 0.5 to 0.65 to obtain more precision in my model. This song “Uprising” by Muse as seen in the orange portion of the graph had a probability score of 0.64. So this song just missed the cut on being a predicted-hit song. Some observations that explain the near miss is that you can see there’s practically no Acousticness at all in the song and we know that was a strong feature of the model. And if you listen to the track, perhaps you can deduce that it’s more rock-oriented vs. pop/dance oriented which mainstream audiences may gravitate more towards. Accounting for things like artist popularity (what mainstream audiences gravitate towards) is not currently built into this model either.

Lastly, here’s an example of a good prediction:

Good Prediction

This is a Taylor Swift song called “Teardrops on My Guitar” and it had a predicted probability at 0.88. It’s very much in line with the audio features of the Post Malone song and hence we correctly predicted it as being a hit song.

I do want to point out however if you listen to both songs, you’ll notice how different they are in terms of the overall sound and perhaps even sentiment. This gives aspiring artists confidence that even though audio features captured by Spotify may be similar between songs, it still leaves plenty of room for creativity and for their unique “voice/sound” to shine rather than having to fit a “voice/sound” that ascribes to a particular formula.

Future Improvements

As seen in the examples of the missed-hit predictions, I hope to account for song release year in the Billboard data by limiting it to more recent years. Perhaps that will help exclude some of the pre-90’s holiday songs and account for seasonality. I’d also like to include data from other sites like TikTok or Youtube which would enable me to put some weight on the artist profiles and popularity rather than just audio features alone.

Thank you for reading this post! For more detail, code, to connect, and other information, please visit my github repository or LinkedIn page.