Continuing our last work — a graphical look into the chart toppers of 2008 and 2018 — we thought we would foray into the world of NLP and text classification by trying to predict a song’s genre from its lyrics. We had already made web scrapers to get songs’ information from Billboard and their lyrics from AZLyrics.com. We thought a great corpus of songs already classified by genre would be the Billboard year end genre Hot 100. We combined our scrapers to fetch songs and their lyrics from a year’s top rap (and r&b), pop, country and rock charts. Running this from 2013 to 2018 gave us over 2000 songs, a large enough dataset to try some classification on.
Obviously, there were some odds and ends and the data did not arrive in the exact format we wanted. We wrote regular expressions in the scraper to remove artist names and the parts of a song (like chorus, verse, etc.) [The full scraper and dataset are up on github]. Then, using some tools from NLTK and SpaCy we further processed lyrics to remove stop words — words like ‘I’, ‘my’, ‘the’, ‘a’, ‘an’; the full list is on the Jupyter Notebook) and lemmatize the remaining words (for example, so that ‘trees’ and ‘tree’ would be considered the same word). Then we ran into the problem of removing duplicate rows. Some songs charted twice in the same year under different genres and some songs showed remarkable staying power and charted in multiple years. Our approach to the former was to remove the pop instance (if a song charted under 2 different genres, one of the genres was always pop) and because we didn’t consider the year to be an important factor for predicting genre, we just kept one of the instances if a song charted in multiple years under the same genre. There were some more minor manipulations, but if you’re interested in the details feel free to look at our notebook.
So now we had the songs in a format we wanted — each song represented by a list of words and we had them labelled as one of rock, rap, pop or country. However, we discovered that there was an unequal distribution of genres, as the pie chart shows:
This was because some Billboard year-end Pop charts had only 50 songs, and in removing duplicates we lost another 20 pop songs (We’re going to run into problems later on because of this, so keep it in mind!). We also decided to group R&B and Rap into a single genre on account of their lyrical similarities, and for the added benefit of having an evenly distributed dataset of 3 classes: country, rock and R&B/Hip-Hop, and about half as many pop songs.
Our next challenge was converting a song’s lyrics into a numerical vector that can be used in a classification model. We used the bag of words method (more on that here) and TF-IDF to achieve this. A brief explanation of the method is: the entire set of words used across all lyrics is the feature space; and for each word a song is given a TF-IDF score, where TF represents the term frequency of the word in that particular song, and IDF represents the (logarithmically scaled) inverse fraction of songs containing the word in the dataset. In this manner, the dataset was represented as an NxW matrix, where N is number of songs and W represents the number of words in the total wordset.
Choice of Model:
The most natural model to us, was Naive-Bayes. Its only assumption is independence between all the features, and uses Bayes theorem to estimate the probability that a song’s lyrics belong to a particular class, using its prior probabilities. A more detailed explanation is available on its Wikipedia entry. Naive-Bayes is commonly used in problems like e-mail spam classification, somewhat similar to our objective here. On prediction, a test accuracy of 0.66 and train accuracy of 0.84 was obtained.
Not well at all initially, but this — we think — had to do with the extremely uneven distribution of genres. We decided to remove pop and try again (the other genres are almost equally distributed)
This time we achieved an accuracy of 77% in classifying our test data. Here’s the confusion matrix for a deeper look on how the model performed:
Diving a little deeper into the confusion matrix reflects the fact that the model has learned as expected. Owing to the dissimilarity of common words used in country and rap, country was almost never misclassified as rap. Specifically, this happened only 1.4% of the time! And due to similarity in lyrics, Rock was mostly misclassified as country with a probability of 0.259, as opposed to a mere 0.092 probability for rap. And as noticeable from the genre word clouds we created in our previous blog, songs in the rap genre had the most distinctive words, leading to the highest prediction precision of 0.918.
We also tried using a Linear Support Vector-Machine classifier (implemented with sklearn’s SGDClassifier) and it produced marginally better results, with a test accuracy of 80%. A simple explanation of SVM is that it attempts to find a hyperplane separating points belonging to each class by maximizing the margin between classes. And for problems (like ours), that involve multiple classes, it does multiple one-vs-one comparisons to classify an input vector. Note: If the training set were to grow larger, SVM with stochastic gradient descent training is not recommended as it does not scale well — the gradient of the SVM objective requires summing over the entire dataset. More information on SVM is available here.
However, the model was slightly overfit, as the training accuracy was at 95% but the test accuracy was at 80%.
So there we have it, we tried a couple different approaches and ended up with an 80% accurate model. We believe performance can be improved with a larger dataset. Feel free to check out our code on Github and do let us know what you think!