Predicting Mandarin Song Popularity on Spotify — Part 2

6 min readNov 22, 2022

Now data exploration and feature engineering have been finished. We can finally start training our dataset. If you haven’t checked out Part 1 of this project, here is the link.

Model Selection

Selecting a suitable machine-learning model has always been a complex topic. First, we need to understand what type of models we are using. This project uses labeled datasets to predict outcomes, which by definition, is supervised machine learning. This project aims to forecast the track_popularity by estimating the relationship between the song attributes. We will use regression machine learning models, which best predict numeric results.

Machine learning can roughly divide into statistical models and neural networks, and I used both of them in this project to have a better direction for this project. When it comes to forecasting or classification, ensemble learning models usually have better results than a single contributing model. I found this article helpful for my understanding of what ensemble learning is. I have selected a few single-contributing models and some ensemble learning models to make a simple comparison.

Statistical Single Contributing Models:

Automatic Relevance Determination Regression (ARD)
Bayesian Ridge Regression

Statistical Ensemble Models:

Gradient Boosting for Regression
AdaBoost Regressor
Histogram-based Gradient Boosting Regression Tree
Epsilon-Support Vector Regression (SVR)
Extra Trees Regression

Neural Networks:

Multi-layer Perceptron Regressor (MLPRegressor)

Model Training

Now the model selection is out of the way. I wrote a function to train and plot the result of each model.

def evaluate(X_train, X_test, y_train, y_test):
    # Names of models
    model_name_list = ['ARDRegression', 
                      'BayesianRidge', 'Extra Trees', 'SVM',
                       'Gradient Boosted', 'AdaBoostRegressor', 'HistGradientBoostingRegressor', 'MLPRegressor', 'Baseline']
    
    # Instantiate the models
    model1 = ARDRegression(compute_score=True, n_iter=50)
    model2 = BayesianRidge(compute_score=True, n_iter=50)
    model3 = ExtraTreesRegressor(n_estimators=50)
    model4 = SVR(kernel='rbf', degree=3, C=1.0, gamma='auto')
    model5 = GradientBoostingRegressor(n_estimators=20)
    model6 = AdaBoostRegressor(random_state=0, n_estimators=50)
    model7 = HistGradientBoostingRegressor()
    model8 = MLPRegressor(random_state=1, max_iter=100)
    
    # Dataframe for results
    results = pd.DataFrame(columns=['mae', 'rmse', 'r_square'], index = model_name_list)

    # Train and predict with each model
    for i, model in enumerate([model1, model2, model3, model4, model5, model6, model7, model8]):
        model.fit(X_train, y_train)
        predictions = model.predict(X_test)
        
        # Metrics
        mae = np.mean(abs(predictions - y_test))
        rmse = np.sqrt(np.mean((predictions - y_test) ** 2))
        Adj_R_square = 1 - (1-r2_score(y_test, predictions)) * (len(y_test)-1)/(len(y_test)-X_test.shape[1]-1)
        
        # Insert results into the dataframe
        model_name = model_name_list[i]
        results.loc[model_name, :] = [mae, rmse, Adj_R_square]
        
        #plot prediction
        plt.plot(predictions[:100])
        plt.plot(y_test.reset_index(drop=True)[:100])
        plt.figure(figsize=(400, 150))
        plt.show()
    
    # Median Value Baseline Metrics
    baseline = np.median(y_train)
    baseline_mae = np.mean(abs(baseline - y_test))
    baseline_rmse = np.sqrt(np.mean((baseline - y_test) ** 2))
    baseline_R_square = Adj_R_square
    
    results.loc['Baseline', :] = [baseline_mae, baseline_rmse, baseline_R_square]
    
    return results

Results

The following images are predicted results vs actual results.

Let’s print out the MAE, RMSE, and R_Square of each result.

As you can see, the results are horrible! All models have MAE scores greater than 8.2 and RMSE greater than 10.1. It is also rare to see prediction results having a negative R Squared. A negative R Squared means the chosen model does not follow the data trend, and its fit is worse than a horizontal line. Based on the graphs, only Extra Trees, Histogram-based Gradient Boosting Regression Tree, and MLPRegressor seem to somewhat follow the data trend.

Why is the result so bad?

So why exactly is the result so bad? I concluded that the two significant reasons that caused the failure of these models are having low variable correlation and having imbalanced data.

Low variable correlation

I graphed a heatmap in Part 1 of this project to see the correlations between variables. The variable with the highest correlation value with track_popularity is liveness which is -0.118. Objectively speaking, even liveness is considered weakly or not correlated to track_popularity. Although variable correlations don’t fully determine regression models predicting/forecasting results, all variables used in this project are poorly correlated and undoubtedly play a significant role in poor prediction results.

2. Imbalanced data

I also noticed that the dataset that I used is very imbalanced. Let’s recap the track_popularity distribution graph and details.

From this graph, we can see that majority of our datasets have a track_popularity between 7 to 57, with min being 10 and max being 70. The graph is positively skewed. In other words, the unequal distribution has affected our prediction results. I looked at the datasets used by other Spotify popularity prediction projects from Kaggle, and they are much more evenly distributed, with min being 0 and max being 100.

I also found something interesting. Instead of predicting the numerical track_popularity value with machine learning models, some people divide track_popularity into sections such as unpopular (score between 0 to 25), less popular (score between 25 to 50), more popular (score between 50 to 75), and very popular (75 to 100). This turns the regression problem into a classification one. I would’ve used this approach if it weren’t for this unevenly distributed data.

What can I do to improve the result?

Data Enrichment

One way that can improve the result is data enrichment. What that means is to collect data that are more valuable (higher correlation with track_popularity). We can get various data from Spotify Web API about a track. For example, it is possible the get the artist’s number of followers on Spotify and the artist’s popularity. Even though I did not run a correlation analysis with these variables, it is not hard to predict that an artist’s popularity plays a significant role in a track’s popularity.

Another example is a variable — available_markets. A list of the countries in which the track can be played. Even if a track has attributes that are predicted to be popular, with limited available markets, the track can result with lower than expected track_popularity. Variables like this should be taken into consideration when making a prediction.

2. Oversampling

Although it is uncommon to do oversampling in a regression topic, the dataset sample is too small to cover extreme cases. We can use the oversampling technique to have a more evenly distributed track_popularity distribution. It duplicates or creates new synthetic examples in the minority classes. In the scheme of this project, we will be oversampling data with track_popularity higher than 50 and lower than 10.

Conclusion

Based on the result of the models, it is fair to say it is hard to predict the track’s popularity solely based on audio features. Although the result is horrible, there are some takeaways on viable methods to improve the accuracy.

Thanks for taking the time to read this article. If you find it helpful, please leave a like and comment!

Here is the Python Notebook for this project.