Predicting the Next Hit Song using Random Forest Classification with Python — Part 1

5 min readMay 20, 2020

Have you ever thought about what makes up a hit song?

Could it be its lyrics? Its social media response? Or is it as simple as its duration?

In an attempt to answer these questions, I will develop a Twitter bot on Python that uses a random forest classification model to predict a song’s popularity based on social media sentiment, lyric analysis, streaming, and past Billboard chart data. I will be collecting social media data from Twitter through GetOldTweets3, lyric data from Genius using LyricsGenius, streaming data from Spotify through Spotipy, and past chart data from Billboard through Billboard.py.

Additionally, I will use Sean Miller’s Billboard Hot Weekly Charts Dataset to obtain past charting and streaming data for model training purposes. To validate my model at the end of the project, I will define a song’s popularity as its rank classification on the Billboard Hot 100 Chart. This project could be divided into 6 steps as follows:

Data Collection & Manipulation
Lyric Retrieval & Sentiment Analysis
Lyric LDA Topic Modeling
Twitter Data Collection & Sentiment Analysis
Random Forest Classification Model
Twitter Bot Development & Deployment (Part 2)

Data Collection & Manipulation

As the Billboard Hot Weekly Charts Dataset consists of two datasets, I will merge these datasets into one such that each song only appears once in the dataset. The two datasets I plan to merge are Hot Stuff.csv and Hot 100 Audio Features.csv which consist of Billboard Hot 100 Chart data and Spotify audio features data for each song respectively. To perform the necessary data collection and manipulation steps, follow the code below:

The code above reads in the two datasets as Pandas dataframes, merges them together, and slices the merged dataset to only include songs that were on the Billboard Hot 100 Chart between January 1st, 2015 and December 28th, 2019. I decided to reduce the scope of the dataset to approximately 5 years worth of data in order to keep the model’s predictions relevant to current music tastes and trends. The code then drops some irrelevant columns from the dataset before creating a new column for the debut positions for each song based on the extent of the dataset’s scope.

Lyric Retrieval & Sentiment Analysis

To collect lyrical data for each song in my dataset, I will use the LyricsGenius library that was built to interact with the Genius API as follows:

The code above creates a new column for lyrics within the dataset and populates it iteratively. After the retrieval process is completed, songs with no lyric data are removed from the dataset and the dataset as a whole is exported as a CSV file for further manual data cleaning on Excel or OpenRefine. The following code helps this process as well.

Finally, to analyze the sentiment of the lyrics, I used the code below to add 3 columns for the percentage of positive, neutral, and negative sentiment respectively, and iteratively analyzed the sentiment of each lyric for each song to populate the aforementioned columns. To perform these steps, I used the NLTK and langdetect libraries as shown below:

Lyrics LDA TF-IDF Topic Modeling

For this step, I used the gensim, NLTK, string, and collection libraries to perform Latent Dirichlet Allocation (LDA) topic modeling on the lyrical data for each song on the dataset to build a topic per document(song) model and words per topic model. The weights assigned to each topic for each song will be inputted into the dataset under a separate column for each topic. For this project, I will train my LDA topic model on a TF-IDF corpus model of the dataset’s lyric data as follows:

Twitter Data Collection & Sentiment Analysis

I will first need to collect the text data of tweets from Twitter using the GetOldTweets3 library. To implement this library, I will use each song’s name and artist name(s) as the search query for collecting tweets while setting the range for collection as 2 weeks before its Billboard charting date and limiting the query to 100 tweets per song. You may increase this limit if you plan to collect data over a longer period of time however, the script may have to sleep for longer than 500 seconds for every 15 tweets collected in order to prevent a timeout error or overloading the library with too many consecutive HTTP requests. After each tweet is collected, its text data will be cleaned to remove usernames and URLs and inserted into a new twitter dataframe.

Moving forward, I will perform a sentiment analysis on the tweets collected for each song in the twitter dataframe by first removing the song or artist name if mentioned. After this step, I will average out the sentiment analysis results into 3 columns, TweetPositive, TweetNeutral, and TweetNegative, for the percentage of positive, neutral, and negative sentiment of each tweet respectively. Once complete, I will merge these results to the Songset dataset and remove songs that do not have tweets associated with them.

Random Forest Classification Model

To develop the random forest classification model, I will use the PySpark library. First, I will copy the Songset dataset into a new dataframe called Featureset and remove any irrelevant fields for the model. Secondly, I plan to create a label column called Classification consisting of 7 labels from 0 to 6 based on each song’s peak position. These labels are meant to signify whether the song was ranked #1, within the top 5, within the top 15, within the top 30, within the top 50, or within the top 75 while excluding those that were ranked as a higher class.

Thirdly, I will convert the Pandas dataframe into a Spark dataframe to later map each value into a Spark RDD as a LabeledPoint consisting of the song’s label and a dense vector of its features. This RDD will then be split 80/20 for the classification model’s training and testing dataset respectively. Finally, I will train the classifier upon the training dataset and use it to predict the testing dataset as follows:

Model accuracy: 67.146%

While the model’s prediction accuracy is not very high, it is moderately good enough to be used to predict a song’s future Billboard Hot 100 Chart rank class. As shown by this metric, the model predicted the label correctly for 67.1% of the songs in the training dataset based on their features.

Area under Precision/Recall (PR) curve: 1
Area under Receiver Operating Characteristic (ROC) curve: 0.990

Furthermore, additional performance metrics help give us a better idea of the model’s prediction capability. The area under the PR curve for this model is 1 while its area under the ROC curve is 0.990.