Using Spotify data to predict which “Novidades da semana” songs would become hits

Evaluation of the accuracy of Random Forest, Logistic Regression, and SVM.

9 min readAug 8, 2020

TL;DR

Spotify is my favorite digital music service and I’m very passionate about the potential to extract meaningful insights from data. Therefore, I decided to do this article to consolidate my knowledge of some classification models and to contribute to the study of other beginners in Data Science.

I constructed a dataset with 2755 hit and non-hit songs and extracted their audio features using the Spotipy library. I tested three classification models (Random Forest, Logistic Regression, and SVM) and choose the model with the best accuracy to predict what new songs would be hits.

1. Introduction

Spotify API provides full access to all music data available on Spotify. To access Spotify API, you have to register on the Spotify website dedicated to developers, select “Create an App”, register your information, and get your CLIENT_ID and CLIENT_SECRET. The API documentation and the data are easy to understand, maintained, and include essential metadata.

We will try to discover what are the five artists that have more songs considered hits, what kind of music is most successful (positive or negative), and try to predict which songs in “Novidades da semana” can become a hit.

2. Dataset and Features

Using Spotipy library, I created two datasets:

2.1 dataset

Composed of songs that are considered hits in the world, e.g., it was collected unique songs of the playlist “Top 50 by country” of all countries. These songs are considered as a hit (success = 1).
The dataset is also composed of unique songs of random playlists from each genre (Sertanejo, Funk, Samba & Pagode, Rock, Jazz, Reggae, among others). These songs are considered as a non-hit (success = 0).
That way, the dataset has 2755 songs considered hits and non-hits.

2.2 test set

The test set is composed of the best new releases “Novidades da semana” playlist that will be used to predict the probability of new songs become a hit.

More datails about how I created the datasets could be found at my Github repository.

2.3 Features

Each track contains features categorized by track, artist and album information, and also audio analysis features. See more about the features HERE. The most relevant features for this article are explained in greater detail in later sections.

Let’s get started!

3. Import the libraries

We will use pandas for data manipulation, NumPy for numerical computing, matplotlib and seaborn to data visualization, and sklearn for machine learning models, evaluation and dataset split.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.metrics import accuracy_score

The playlists “Top 50 by country” are updated daily and the “Novidades da semana” playlist, besides is updated every week, it could be different based on your profile. That way, the csv files contain the date that they were generated.

dataset = pd.read_csv('spotifyAnalysis-08022020.csv')
test = pd.read_csv('predictSpotifyAnalysis-08022020.csv')

4. Data overview

Let’s visualize the dataset and its features.

dataset.head()

Using pandas.DataFrame.describe, we can see the following statistics and analyze the central tendency, dispersion and shape of a dataset’s distribution.

dataset.describe()

We can observe that tempo, key, duration_ms, loudness and popularity features are not on the same scale, so we will rescaling the data in the next section.

5. Data Cleaning

There are no missing data and there is no need to treat categorical variables.

5.1 Data Rescaling

We will use the MinMaxScaler which rescaling is done independently between each column, in such a way that the new scale will be between 0 and 1 (or -1 and 1 if there are negative values in the dataset) and also preserves the original distribution.

MinMaxScaler subtracts each value by the lowest value in the column and then divides it by the difference between the maximum and minimum value.

# Rescaling tempo, key, duration_ms, loudness and popularity features.
scaler = MinMaxScaler()scaled_values = scaler.fit_transform(dataset[['tempo', 'key', 'duration_ms','loudness', 'popularity']]) 
dataset[['tempo', 'key', 'duration_ms','loudness', 'popularity']] = scaled_valuesscaled_values = scaler.fit_transform(test[['tempo', 'key', 'duration_ms','loudness', 'popularity']]) 
test[['tempo', 'key', 'duration_ms','loudness', 'popularity']] = scaled_values

6. Exploratory Data Analysis

6.1 Correlation

Correlation is a statistical technique to measure how variables are related.

Positive correlation: Indicates that the two variables move together.
Negative correlation: Indicates that the two variables move in opposite directions.

plt.figure(figsize=(12,12))
corr = dataset.corr()
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask, 1)] = True
sns.heatmap(corr, mask=mask, annot=True, cmap="Greens")

The variables with a stronger correlation are loudness x energy (strong and positive) and acousticness x energy (strong and negative).

Loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
Energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
Acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.

Let’s visualize the correlation between the variables.

axis = ['ax0','ax1']
features = [['energy','loudness'],['energy','acousticness']]
colors = ['#48d66c', '#bd36d8']
titles = ['Energy x Loudness', 'Energy x Acousticness']
plot_dist_reg(1, 2, axis, features, colors, titles)

It can be concluded that tracks with higher energy tend to have higher volume in decibels (loudness) and tracks with less energy tend to be an acoustic song.

6.2 Class visualization

Let’s visualize the class distribution.

plt.figure(1 , figsize = (15 , 5))
ax = sns.countplot(y = 'success', data = dataset, palette="Greens")
ax.set_title('Number of success (1) and non success (0) songs')
show_values_on_bars(ax, "h", 10)
plt.show()

There are more non-hit songs then hit songs in the dataset.

6.3 Hit songs

The next step is to analyze the songs considered as hits.

# Get only hit songs
hits_df = dataset[dataset['success'] == 1]

What are the five artists that have more songs considered hit?

top_artists = hits_df['artist'].value_counts()[:5]
name = top_artists.index.tolist()
amount = top_artists.values.tolist()plt.figure(1 , figsize = (15, 5))
ax = sns.barplot(x = name, y = amount, palette="Purples_d")
ax.set_title('Artists with more hit songs')
show_values_on_bars(ax, "v", 10)
plt.show()

What kind of music is the most successful: positive or negative?

valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

To exemplify what is a song considered positive or negative, the song with the lowest valence (0.0349) in the dataset is Maia (Kamilo Sanclemente) and the song with the highest valence (0.9770) in the dataset is Corona (Minutemen).

valence = hits_df['valence'].value_counts()
valence_value = valence.index.tolist()
amount = valence.values.tolist()
i, high, low = 0, 0, 0for v in valence_value:
    if (float(v) >= 0.5):
        high += amount[i]
    else:
        low += amount[i]
    i += 1print('Positive tracks: ', high)
print('Negative tracks: ', low)output >>> Positive tracks: 704
           Negative tracks: 547

So, most hit songs are positive (happy, cheerful, euphoric).

7. Machine Learning Modeling and Evaluation

The dataset was split into training (70%) and test (30%).

# Split features and class data and drop irrelevant columns
X = dataset.drop(['success', 'artist', 'track_name'], axis=1).values
y = dataset[['success']].values# Split train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

To predict whether a song will be a hit or not, we will use three different models (Random Forest, Logistic Regression and SVM) and select the best one based on the accuracy result.

Accuracy is how close a measurement is to the true value.

7.1 Random Forest

The random forest model combines a hundred of decision trees, each of which is trained on a different subset of the song features and different subset of the training data. The model makes a prediction, i.e., decides if a song is a hit or non-hit, performs a vote for each predicted result and then selects the prediction result with the most votes as the final prediction [3].

# Create the classifier object
rf_model = RandomForestClassifier(n_estimators = 100)# Train
rf_model.fit(X_train, y_train.ravel())# Predict
y_pred = rf_model.predict(X_test)print('Accuracy: ', accuracy_score(y_test, y_pred))output >>> Accuracy: 0.7315598548972189

7.2 Logistic Regression

The logistic regression model linearly separates the data into two categories, i.e., predicts the probability of occurrence of a binary event utilizing a logit function and assigning a weight to each song feature, then uses these weights to predict whether a song is in the “hit” or “non-hit” category [4].

# Create the classifier object
lg_model = LogisticRegression()# Train
lg_model.fit(X_train, y_train.ravel())# Predict
y_pred = lg_model.predict(X_test)print('Accuracy: ', accuracy_score(y_test, y_pred))output >>> Accuracy: 0.6952841596130592

7.3 SVM

The SVM model selects the best “hyperplane” (e.g., the “hyperplane” which has the maximum possible margin between support vectors) that separates the data into two categories [5].

# Create the classifier object
svm_model = svm.SVC(kernel='linear')# Train
svm_model.fit(X_train, y_train.ravel())# Predict
y_pred = svm_model.predict(X_test)print('Accuracy: ', accuracy_score(y_test, y_pred))output >>> Accuracy: 0.6977025392986699

7.4 Evaluation

The accuracy of the 3 modeling methods are:

Random Forest: 0.731
Logistic Regression: 0.695
SVM: 0.697

8. Result

As a result, the Random Forest model will be applied to predict the songs from “Novidades da semana” on Spotify.

# Drop irrelevant columns
df_test = test.drop(['artist', 'track_name'], axis=1).values# Predict
test_predict = rf_model.predict(df_test)# Get only predict hit songs
hits_predict = (test_predict == 1).sum()
print(hits_predict, "out of", len(test_predict), "was predicted as HIT")output >>> 13 out of 60 was predicted as HIT

Which songs in “Novidades da semana” can become a hit? Let’s see the result.

df = pd.DataFrame({'Song': test['track_name'], 'Artist': test['artist'], 'Predict': test_predict})
df.sort_values(by=['Predict'], inplace=True, ascending=False)
df

9. Conclusion

Analyzing Spotify data on August 2nd, 2020, it could be concluded:

- The five artists who have more songs considered hits are Taylor Swift, KESI, Boza, Apache 207 and Bad Bunny.

- Most hit songs are positive (happy, cheerful, euphoric).

- The model with the best accuracy to predict what new songs will be hits is Random Forest.

- The songs of “Novidades da semana” that have a probability to be hits based on hits characteristics of “Top 50 by country” (all countries) are Clap From Road To Fast 9 Mixtape (Don Toliver), Cuidado Que Eu Te Supero (Yasmin Santos), my future (Billie Eilish), My Oasis feat. Burna Boy (Sam Smith), I Should Probably Go To Bed (Dan + Shay), Who’s Laughing Now (Ava Max), WHAT YOU GONNA DO??? (Bastille), TOMA (Luísa Sonza), Lei Áurea (Borges), The Usual (Sam Fischer), By Any Means (Jorja Smith), Move Ya Hips feat. Nicki Minaj & MadeinTYO (A$AP Ferg) and Hawái (Maluma).

See the complete code HERE.

10. References

[1] CORRELATIONAL ANALYSIS: POSITIVE, NEGATIVE AND ZERO CORRELATIONS. https://psychologyhub.co.uk/correlational-analysis-positive-negative-and-zero-correlations/
[2] Song hit prediction: predicting billboard hits using Spotify data. arXiv:1908.08609 [cs.IR]. arxiv.org/abs/1908.08609
[3] NAVLANI, Avinash. Understanding Random Forests Classifiers in Python. https://www.datacamp.com/community/tutorials/random-forests-classifier-python
[4] NAVLANI, Avinash. Understanding Logistic Regression in Python. https://www.datacamp.com/community/tutorials/understanding-logistic-regression-python
[5] NAVLANI, Avinash. Support Vector Machines with Scikit-learn. https://www.datacamp.com/community/tutorials/svm-classification-scikit-learn-python