Spotify: Analyzing and Predicting Songs


If there’s one thing I can’t live without, it’s not my phone or my laptop or my car — it’s music. I love music and getting lost in it. My inspiration for this project is finding out what it is about a song that I enjoy so much.


  • What features of a song do I like most/least?
  • How does music I like compare to music I don’t like?
  • Create a predictive model on whether I like or dislike a song.


I compare 2 of my playlists from Spotify:

  • Liked playlist (630 songs)
  • Disliked playlist (537 songs)

After using Python and some data wrangling techniques, the data frame below is what I use to do some exploratory data analysis (EDA).

1. BPM — Beats per minute. The tempo of the song.

2. Energy — The energy of a song — the higher the value, the more energetic the song

3. Dance — The higher the value, the easier it is to dance to this song.

4. Loud — The higher the value, the louder the song.

5. Valence — The higher the value, the more positive mood for the song.

6. Acoustic — The higher the value the more acoustic the song is.

7. Popularity — The higher the value the more popular the song is.

8. Year — Release year of the song.

9. Duration — Length of song in seconds.

Exploratory Data Analysis (EDA)

Again, using Python, I was able to create this visualization of distributions between my Liked (blue) and Disliked (red) songs.

Looking at the distributions of each feature, there are clear distinctions between my Liked and Disliked songs, especially in the ENERGY, DANCE, LOUD, and ACOUSTIC features.

  • ENERGY: I prefer songs that have a normal distribution in energy and dislike songs that have higher energy
  • DANCE: I prefer songs that are more danceable to
  • LOUD: I dislike songs that are super loud
  • ACOUSTIC: I dislike songs that are not acoustic at all

Predictive Model

Now that I’ve determined that there are clear differences between songs I like and songs I dislike, I create a predictive model.

I use supervised learning, classification algorithms to predict whether I like or dislike a song. The 3 models I use are: k-Nearest Neighbor, Logistic Regression, and Random Forest.

Running the Models

After balancing the data and splitting it into training and testing sets, I run the 3 models on the data. I decided to use the following metrics to score the quality of each model: ROC AUC, Accuracy, Precision, Recall.

Below are the results, using the default parameters for each classifier:

Looking at the metrics and ROC Curves, the Random Forest Classifier is the clear winner. With an ROC AUC score of 91.94% and an accuracy score of 83.87%, the model performed fairly well on the test set with using just the default parameters.

On the next section, I will be performing hyperparameter tuning on the Random Forest classifier to see if the model can be improved.

Tuning the Best Model — Random Forest

The Random Forest Classifier has a number of parameters available. But I will only be tuning the following parameters with the following ranges and values:

n_estimators: np.arange(10,200,10)
min_samples_leaf: np.arange(1,100,10)
max_features: ['auto','sqrt','log2']

After using Scikit-Learn’s GridSearchCV() to tune the parameters, the optimized parameters are as follows:

n_estimators: 80
min_samples_leaf: 1
max_features: 'sqrt'

After using the new optimized parameters, I compare the results of the model using default parameters and the model using the optimized parameters:

There is a definite improvement when using the optimized parameters, especially in Recall, with a 2.42% improvement.


The goals of this project were to find out what features of a song I like/dislike and to predict whether I like or dislike a song. Through exploratory data analysis and machine learning, these goals were accomplished.

After doing some exploratory data analysis, I found that I like songs that are lower in ENERGY, lower in VALENCE (less positive songs), and are more ACOUSTIC and I dislike songs that have a higher BPM, are less DANCEABLE to, are LOUDER, and are less POPULAR.

After trying three different models to predict whether I will like or dislike a song, the best performing model is Random Forest with hypertuned parameters. Overall, I am pleased with the results and believe the model can be useful in predicting whether or not you will like a song.