Owen Carey
7 min readJul 9, 2018
  1. “All You Need Is Love” — John Lennon?

BACKGROUND

The Beatles are the greatest band in history. In my mind it’s undebatable. During their eight years of recording between 1962 and 1970 they released 13 albums and a number of tracks issued on standalone singles. The catalogue created in that short period has sold more than that of any other group in history, but the group’s significance stems not just from the huge sales figures. Their music has inspired generation upon generation of musicians, songwriters and producers. Including me! There was The Beatles …and then there was everyone else.

The Beatles are one of my all time favorite bands. I love their expansive musical style that included rock, pop, folk, blues, psychedelia, Indian classical— They could create anything. And the lyrics were always something special. My favorite Beatles’ lyrics were written by John Lennon and Paul McCartney in the song Nowhere Man:

He’s a real nowhere man
Sitting in his nowhere land
Making all his nowhere plans for nobody

Doesn’t have a point of view
Knows not where he’s going to
Isn’t he a bit like you and me?

Nowhere man please listen
You don’t know what you’re missing
Nowhere man, The world is at your command

As a fun side project, I wanted to explore the lyrics of The Beatles using machine learning! I tried to build a classifier that could predict the Beatle songwriter(s) based solely on the lyrics!

DATA

Songwriter: John Lennon, Paul McCartney, George Harrison, Multiple Beatles (Dropped the one song written solely by Ringo Starr). Songwriters were found from Wikipedia.

Song Lyrics: All 206 released, non-instrumental Beatle songs. Lyrics were scraped from many different sources on the web.

First Five Songs.

DATA PREPARATION

Before being able to do any modeling using the song lyrics, the lyrics needed to be cleaned and prepared. A clean dataset will allow a model to learn meaningful features and not overfit on irrelevant noise.

These are the steps I followed to clean and prepare the lyrics:

  1. Removed line breaks in the song lyrics, shown as “\n”.
  2. Removed all irrelevant characters such as any non alphanumeric characters.
  3. Converted all characters to lowercase, in order to treat words such as “hello”, “Hello”, and “HELLO” the same.
  4. Tokenized the lyrics by separating the words it into ngrams, grid searching over unigrams, bigrams (groupings of two words), and trigrams (groupings of three words) when modeling.
First Five Songs w/ Cleaned Lyrics.

DATA REPRESENTATION

Machine Learning models take numerical values as input. Our dataset is a list of sentences, so in order for our algorithm to extract patterns from the data, we first need to find a way to represent it in a way that our algorithm can understand, i.e. as a list of numbers.

A natural way to represent text for computers is to encode each character individually as a number (ASCII for example). If we were to feed this simple representation into a classifier, it would have to learn the structure of words from scratch based only on our data, which is impossible for most datasets. We need to use a higher level approach.

For example, we can build a vocabulary of all the unique words in our dataset, and associate a unique index to each word in the vocabulary. Each sentence is then represented as a list that is as long as the number of distinct words in our vocabulary. At each index in this list, we mark how many times the given word appears in our sentence. This is called a Bag of Words (BoW) model, since it is a representation that completely ignores the order of words in our sentence.

Bag of Words (BoW) Dataframe.

WHOLE SONG CLASSIFICATION

In the first iteration of this project I tried to train on entire Beatles’ songs, instead of individual Beatles’ lyrics (coming later). This proved to be fairly difficult as there were only 206 total songs to train and validate on. Let’s take a look at the class distribution:

Class Distribution Of Entire Songs Written.

Before jumping into modeling. Let’s use Latent Semantic Analysis (LSA) to reduce the dimensions of the above Bag of Words dataframe, which will end up having a lot of dimensions since there are a lot of different words to count. We can use LSA to reduce the sparce (meaning mostly filled with 0s) Bag of Words matrix down to two dimensions and plot them!

Two Component LSA On BoW Dataframe.

We can see that the data isn’t very well spread out between the classes, this is an indication that classifying these separate songwriter(s) based on lyrics alone may be difficult. Let’s retry after applying a Term Frequency-Inverse Document Frequency (TF-IDF) transformation to the Bag of Words dataframe. This weight is a statistical measure used to evaluate how important a word is to a particular Beatles’ song in the collection. The importance increases proportionally to the number of times a word appears in the individual song but is offset by the frequency of the word in all songs.

Two Component LSA on BoW + TFIDF Dataframe.

There’s slightly more separation between the classes but no clear patterns. Let’s try some machine learning classifiers to see how well we can classify the songwriter(s) using our BoWs.

For this project I decided to try three machine learning classifiers. First, logistic regression because it is a classic baseline classifier and often works surprisingly well with Bag of Words data. Two, multinomial Naive Bayes because this is an extremely fast and simple classification algorithm that is often suitable for very high-dimensional datasets. Finally, k-nearest neighbor because this classifier is typically very bad with high dimensional datasets like this — so I was curious to see how it’d do!

For each of the three classifiers I used a hyper-parameter grid search, with 10-fold cross validation on 20% of the data, to find the best hyper-parameter combinations. To calculate model performance, a holdout set consisting of another 20% of the data was used.

Entire Song Classification Results Using BoW.

These results are decent, the majority class “Multiple Beatles” takes up 31.6% of the dataset, so by just predicting “Multiple Beatles” every time we’d achieve an accuracy of 31.6%. We beat that with Logistic Regression’s 50.0% and Multinomial Naive Bayes 40.5%!

The grid search for logistic regression found one of the best hyper-parameters to be bigrams (groupings of two words). Here are the top unigrams, and bigrams that classified each songwriter(s):

Top N-Grams For Each Songwriter(s) Classification In LR.

Let’s see if we can get better classification results by classifying indivdiual song lyrics rather than entire songs!

INDIVIDUAL SONG LYRIC CLASSIFICATION

Let’s take a look at the dataset now that we’ve divided the songs into individual lyric lines:

First Five Songs w/ Cleaned Lyrics.

Here’s what the new class distribution looks like:

Class Distribution Of Individual Song Lyrics Written.

So again, the baseline accuracy score to beat is 34.1% because we could achieve that by predicting the lyric was written by “Multiple Beatles” every time! Here’s what the dataset looks like after similar LSA dimensionality reduction:

Two Component LSA on Bow, and BoW + TFIDF Dataframes.

Again, for each of the three classifiers I used a hyper-parameter grid search, with 10-fold cross validation on 20% of the data, to find the best hyper-parameter combinations. To calculate model performance, a holdout set consisting of another 20% of the data was used.

Individual Lyrics Classification Results Using BoW.

This time we definitely beat the baseline accuracy score of 34.1%! However, the increased accuracy was likely just due to the repeated lyrics used throughout songs. But still pretty cool!

The grid search for logistic regression found one of the best hyper parameters to be trigrams (groupings of three words) in this case! Here are the top unigrams, bigrams and trigrams that classified each songwriter(s):

Top N-Grams For Each Songwriter(s) Classification In LR.

CONCLUSION

This was just a brief dive into the Beatle’s lyrics using basic Natural Language Processing (NLP) techniques with machine learning. Nothing too ground breaking, but I had a lot of fun working with this subject —and I was definitely listening to The Beatles the whole time :)

If anyone wants to use the lyric dataset or my code it is available on my GitHub along with other metadata about the songs!

GITHUB

GitHub Beatles Lyric Classification