The Natural Language of Songs

Luke Falvey
5 min readAug 18, 2019

--

If I handed you the lyrics to a song and asked you to guess whether it came from the Jonas Brother’s or 2Pac, how good would your guess be? It should be easy for us to tell the difference without even listening to the song. The two artists make music in different genres and we would therefore expect them to use different words. As an english speaker, we might even be able to speculate as to what words would pop up more frequently. How about a non-english speaker; would it be possible for them to distinguish music coming from the Jonas Brothers or 2Pac?

Let’s start with how a non-english speaker might approach the task of differentiating music from the Jonas Brothers and 2Pac. One way could be to learn english and therefore conceptually understand the difference in topics and ideas expressed. Although this approach is good, it would take a very long time and isn’t specific to the task. Another way to could be to count the number of unique words used in a song. Although it’s not true in all cases, pop songs tend to use less words than rap songs. Therefore, if we saw that a song used 300 different words, we might guess 2Pac. On the other hand, if we saw a song used only 100 different words, we might guess the Jonas Brothers. I’m sure we could think of many other useful features, however, let’s see if this would work in practise. Another more labour-intensive approach this would be to count individual words. Take the word gun for example, if our non-english speaker went through all 2Pac and Jonas Brother songs, would they find that word appearing more frequently in 2Pac songs or Jonas Brother songs? My bet would be 2Pac. Although our non-english speaker won’t know what the word gun means, because it appears more frequently in 2Pac songs, it is a strong signal that the song belongs to 2Pac. If we extended this idea out to all words that the artists use, I bet we’d have some good predictive power.

Let’s start with our research. I’ve gone away and pulled the lyrics from all 2Pac and Jonas Brother songs. Let’s begin with our idea of counting unique words.

A violin plot of the number of unique words in 2Pac and Jonas Brother songs. Data was sourced from www.metrolyrics.com
The number of unique words used across 2Pac and Jonas Brother songs. Lyrics were sourced from www.metrolyrics.com

The violin plot above shows the distribution of unique words across songs for each artist. The width of each violin indicates the number of songs with that frequency of unique words. We can see that 2Pac songs use a larger vocabulary of words than Jonas Brother songs. In fact, most songs use around 300 unique words, which is 200 more than the majority of Jonas Brother songs. We can also see that any song with more than ~170 words is most likely from 2Pac and anything below is probably from the Jonas Brothers. Without knowing anything about the song, it looks like our non-english speaking friend can accurately guess the artist of the song by only counting unique words. In fact, if we were to apply this rule, we’d be right 98% of the time.

Another interesting approach to this problem is to count the frequency of words that appear in each song. Below I’ve created a word cloud that depicts the frequency of words used across songs by the two different artists. I’m sure you’ll be able to guess which cloud belongs to which artist.

As you can see, the words 2Pac and the Jonas Brothers use are quite different, however, there are also words like “see”, “im” and “know” that we’d expect to be universally present across songs written in english. For our purposes, we can focus on the words that are different in the two clouds and use these as signals to figure out if the song belongs to 2Pac or the Jonas Brothers. For example, if we see “la la”, “hey hey” or “yeah yeah”, we can be pretty sure it’s the Jonas Brothers. On the other hand, if we see “thug life”, “homies” or “cop”, it’s more than likely 2Pac.

Once we’ve counted the frequencies of words in each song and laid it out in a big table, we can start to apply a couple of fancy math tricks to start plotting the songs.

TSNE applied to our big table of word counts.

The graph above shows us the similarity of each song according to what words are used in the song and how often those words occur. We can see two distinct clusters of dots; one belonging to 2Pac and the other belonging to the Jonas Brothers. Interestingly, some of 2Pac’s songs invade the cluster of Jonas Brothers songs. Any guesses as to why this might happen? Although, our non-english speaking friend may not be able to use these results directly; like the chart of unique words, it shows there is a quantitative difference between words used by either artists in their songs.

Calculating word frequencies for songs is a very labour-intensive and repetitive task and therefore probably isn’t a good approach to tell 2Pac and Jonas Brother songs apart. However, computers are very good at these repetitive tasks and using the exact same method, we can train a computer to tell the difference between 2Pac and Jonas Brother songs. In fact, by looking at word frequencies alone, my computer can achieve an accuracy of 98%. Although, this is a somewhat contrived example, how well do you think a computer will generalise to recognise other artists? Do you think it could it do a better job than you?

Code in Github:

--

--