Using NLTK to visualize my favorite albums’ lyrics

A few weeks ago I was enrolled in Python for Data Science by UCSD on EdX.org. It is an introductory course so it starts with the basics but by the end of it you have worked with Twitter’s API, predicted weather using Machine Learning and even done some Natural Language Processing using NLTK.

The last grabbed my attention because, before then, I had hardly thought of language as a source of information. I used to think of language as just a tool. A way to get information from my head into yours.

I was so wrong. So, so wrong.

My project: Números Fantasma

If you read Spanish you can check out a more detailed version of this note on my website elblogdehiphop.com

Números Fantasma translates roughly to Phantom Numbers. This title derives from one of my favorite all-time albums Luces Fantasma by La Banda Bastön. Their album’s title relates to the supernatural and “those things that are there all the time but you cannot see”(Noisey en Español).

This resonated with me instantly because as a data analyst my job is literally to show or highlight “things that are there all the time but you cannot see” right way.

My objective was (semi-)clear: What other information is there in this album? That information that one may not get right away. I loved the album instantly but I couldn’t point out exactly why. Could I analyze it differently (not just by listening to it) to find that out? Could I do this for other albums?

What I did

Using Python’s NLTK library I quickly cleaned the lyrics for the album and created a word count table. I had transcribed the album for Genius.com earlier this year so I already had Luces Fantasma but to make this more interesting I also looked at Kendrick Lamar’s “DAMN.” (I also used Genius.com to get its lyrics — I love Genius).

Now, after I had created these tables I found the website WordCountTools.com which does pretty much the same thing BUT also gives you a ton of other metrics like # of monosyllabic and multisyllabic words which I was very interested in as this is a hip hop album.

Secondly, I looked at Spotify to grab other information that may be of interest. I manually grabbed the number of plays each track had and using their API I grabbed their audio features.

Third, I looked at what was the best way to show this information. These are just numbers, how can I make this tell me something I did not know before.

The Viz

I decided to go with Tableau for the visualizations. I had been working on developing my Tableau skills so this seemed like good practice.

Here I show 4 aspects of each track:

  1. The red diamond represents the percentage of the track rapped by the artists. 2 skits in Luces Fantasma are performed by featured artists so their diamond is at 0%
  2. The red dotted line shows the percentage of unique (non-repeated) words in a track. I chose to show this metric because I have been fascinated with rappers’ vocabularies since Matt Daniels’ The Largest Vocabulary in Hip Hop.
  3. The blue bar shows the number of monosyllabic words.
  4. The blue line shows the number of multisyllabic words.

First impressions

Muelas de Gallo (La Banda Bastön’s MC) is an incredible MC. I knew this but I was never able to quantify it. Now, looking at his work side-by-side to Kendrick’s I could see how amazing of a lyricist he truly is. This is mostly because American Hip Hop artists’ skills have been more well-documented and analyzed.

In Hip Hop, content is as important as how you deliver it. Luces Fantasma talks about love, death, modern México and the struggles of its citizens. DAMN. is just as amazing and I believe that’s well established.

This analysis is not about the content (even though I tried doing some sentiment analysis). It is about the raw delivery (I’m not even looking at flow). In the raw numbers, Muelas de Gallo uses a “more complex” vocabulary. He has more multisyllabic and more non-repeated words overall. Muelas also does not repeat himself many times, the most repeated words have a count of 51 while Kendrick’s go up to 104.


The Viz 2.0

Because I have been working on developing my data visualization skills I spent the next couple of days thinking about ways of presenting this information. I loved the bubbles and using red for Lamar’s DAMN. but Luces Fantasma’s album cover is beautiful and I wanted to incorporate as much of it as possible.

Luces Fantasma by La Banda Bastön

The left-most head is Muelas de Gallo (La Banda Bastön’s MC) and it’s morphing into Dr. Zupreeme’s head (La Banda Bastön’s DJ/Producer). Everything about it is amazing.

My first attempt at incorporating this was using purple for Muelas’ bubbles. Then I learned I could do a word cloud using this album cover as a mask.

I used WordArt.com for it et voilà

Luces Fantasma’s lyrics word cloud (doesn’t include features’ lyrics)

Not bad… not great but not bad. I still have lots to work on as a designer.

My second and last attempt for now is Kendrick’s DAMN.

Kendrick Lamar’s DAMN. lyrics

This one I really love.


What comes next

The next steps are:

  1. Developing Natural Language Processing skills. I want to be able to count syllables myself and not depend on WordCountTools.com
  2. Developing visualization skills in Python. There is a WordCloud library for Python I could use. I want to familiarize myself with it.
  3. Developing design skills. Throwing the lyrics on Kendrick Lamar’s album cover looks super cool but that’s because the lyrics and the cover themselves are cool. I want to become better choosing colors and type fonts.
Like what you read? Give chekos a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.