# Predicting Popular Spotify Songs

# Abstract

Music is such an important aspect in our lives. For most people, it is part of their daily routine. However, this outbreak has increased our music listening activities. Therefore, it is interesting as a music producer myself to see what the charts would look like this year especially during the outbreak. Unfortunately, there isn’t sufficient data for 2020 music since the year is still going on. Therefore, this project aims to attempt to predict the top 10 songs for 2019 and compare the actual results of that year. The research will be conducted in 3 parts: visualizing the data, testing the hypothesis on our data, and creating a regression model for the data we use.

## Choosing a Data set

For this project, I decided to use a data set that took the top 50 songs of the year ranging from 2010 to 2019 with 2019's data cut off so that we can use that as the test set. The data set has the song title, as well as the bpm, energy level, popularity, acoustic level, and more. The link is right here:

# Visualizing and Modeling Data Accurately

Before we go on to creating a regression model, we need to visualize our data to come to an understanding on which data variables are significant into factoring our model. For that, we need to choose a good data visualization and a good plot.

## Choosing a Visualization Method

Choosing a method to accurately portray our data is crucial to moving on. It is imperative that we don’t choose a plot that misleads us to incorrect assumptions and hypotheses.

Lets start with the two most well-known plots: The bar graph and plotting points. We will use the song bpm as our y-axis and the year of the song as our x-axis.

The bar graph looks like this:

As you can see, bar plots are an extremely poor visualization in this scenario. It tells us nothing about the change in bpm per year; just collapsing a bunch of bpms per year with no direction.

Let’s look at the dot plot:

A dot plot looks a little better than the bar plot as we can slightly see the concentration of the bpms per year. However, it is still difficult to tell where most of the bpms lie each year and therefore lacks a clear direction. There is something better we can use.

When we try using a box & whisker plot it looks like this:

This looks more like it. The thick black line clustered all the dots into one average per year. That way, we can see a clear direction of generally where the bpm is going per year. It seems that it starts to go lower as the years go but we will get there later on.

## Choosing a Variable to Visualize

Now that we have a good way to visualize our data, now we need to find one or more variables that would help to find a trend in our data in order to formulate our hypothesis and create our model. For that, we take advantage of the mean in each of the years and see the trend that way. For this I used trial and error, testing out each variable and seeing which had the most noticeable trend.

What I found was that the song energy was the most noticeable trend. It has a slight downward trend per year which was quite interesting. We can induce from this data and hypothesize that song energy level decrease as years go by. However, with this kind of slighter trend we need to test it. Therefore, we will put our data under hypothesis testing.

# Hypothesis Testing

We want to test our hypothesis is valid before we move on to creating our model. Otherwise, our model would be inaccurate, pointless, and based off chance. Simply comparing the mean of the values doesn’t tell us much, and we need to find the p-values to measure significance.

For this, we will use multiple hypothesis testing. Specifically, we will conduct multiple permutation tests to obtain multiple p-values. Note, since our hypothesis is that each year brings a decrease in energy, our NULL hypothesis must be that there is not significant change in energy between the years.

Choosing the type of p-value testing was important too. Between a single sample and two samples hypothesis test, it was clear which one we had to choose. Since we have statistics from two years that we were choosing from, it was obvious that we chose a two samples hypothesis test. In addition, there is the matter of a one-tail and two-tail test. Since we were looking for a direction of an effect — namely the direction of song energy — in one side instead of a relationship in both direction, it made more sense to choose a one-tailed test.

When conducting the hypothesis tests, it turns out that energy difference is relatively significant when comparing energy year by year. That data seemed to reject the null hypothesis more often than other factors as I tested each variables. Of course there were a few ups and downs, however, out of the rest of the data, it was the most significant and showed most promise.

It is a good thing that we checked our hypothesis. Otherwise, we would have been getting weird data that wouldn’t matter. Now, we can go on creating out regression model.

# Creating The Regression Model

Now it is time to create our Regression model. Revisiting our graph, it seems to make sense that we use a linear regression model to model our data. Since we are predicting for one year, we get one value and pick 2019 values that are closest to that predicted value.

What we see is a RMSE error value of 17.12, which in all honesty is not too bad in this situation. There were a lot of high and low energy levels. Therefore, considering that, it was a good result. I also tried other regression models: exponential, log-log, quadratic; the error on these models were just too large or larger than linear regression; it made no sense to use these models.

## Results

What we got was a predicted mean energy level of ~64. What we do from here is pick any songs that are closest to the 64 energy level. Using a while loop, I printed out all of the results with the following code:

Since it only grabbed the first appearances of the closest, the order does not matter in this case. The results are here:

- Beautiful People (feat. Khalid)
- Find U Again (feat. Camila Cabello)
- South of the Border (feat. Camila Cabello & Cardi B)
- Truth Hurts
- How Do You Sleep?
- Higher Love
- I Don’t Care (with Justin Bieber)
- Call You Mine
- Sucker
- Señorita

In general, the results were not too shabby. Out of the 10 predicted, it was able to grab 4 accurate predictions from the actual top 10 list (Señorita, How Do You Sleep, South of the Border, Truth Hurts). In addition, the other 6 predictions all lie around the top 20 range of the entire 2019 data set. Considering that this was a basic model that used linear regression and one variable to determine it, the results were pretty nice.

# Wrapping Up

Ultimately, this was a fun project to work on. It was interesting to see and understand working patterns in our music and how our tastes change gradually as years go by. In addition, it’ll be interesting to see how 2020’s data will be affected since these times of distress might introduce an anomaly in our data.

This data showcases how we approached music back then, how we approach it now and how we will do so in the future. There were a bunch of other trends that I noticed regarding the types of music we like and it was quite astonishing.