What exactly does Taylor Swift sing about?

5 min readJan 4, 2020

Performing Topic Modeling on Taylor Swift lyrics.

Introduction

Taylor Swift is a multi-million dollar, Grammy-winning singer songwriter, with an enormous fanbase. People love her music because they can relate to what she is singing about. So what exactly is she singing about? Has it changed over time? Using topic modeling, I dive into the lyrics of each Taylor Swifts songs to find the answers.

Dataset

The lyrics for all songs except those on the Lover album can be found here. Since this dataset was posted before her most recent album was released, I scraped the lyrics for the songs on her Lover album from Genius, and stored the data in a .csv file. I completed this task using selenium in Google Colab, as shown below:

!pip install selenium
!apt install -yq chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
import pandas as pd
from pandas import DataFrame
import sys
from selenium import webdriver
from google.colab import drive
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver', options=chrome_options)title = []
lyrics = []driver.get('https://genius.com/albums/Taylor-swift/Lover')for i in range(1,19):
    song_element = driver.find_element_by_xpath('/html/body/routable-page/ng-outlet/album-page/div[2]/div[1]/div/album-tracklist-row['+str(i)+']/div/div[2]/a/h3')
    #print(song_element.text)
    title.append(song_element.text)
    song_element.click()
    lyric_element = driver.find_elements_by_xpath('/html/body/routable-page/ng-outlet/song-page/div/div/div[2]/div[1]/div/defer-compile[1]/lyrics/div/div/section')[0]
    #print(lyric_element.text)
    lyrics.append(lyric_element.text)
    driver.execute_script("window.history.go(-1)")
    
df = DataFrame()
df['song_title'] = title
df['lyrics'] = lyricsdf.to_csv("lover_lyrics.csv")drive.mount('drive')#export csv to google drive
!cp lover_lyrics.csv drive/My\ Drive/

In order to use the data I scraped, I needed to preprocess the text to get rid of any special characters.

#remove 'Lyrics' from song titles in Lover album
def strip_lyrics(x):
    x = x.replace('Lyrics', '')
    return x#remove Verse, Intro, etc from lyrics (for lover_df dataframe)
def strip_words(x):
    return re.sub('\[[A-Za-z0-9:&\s-]*]\n', '', x)

Now the lyrics from the Lover album are in the same format as the other dataset, and both are ready for EDA.

EDA

Before topic modeling, I wanted to perform some EDA to get a brief overview of the dataset and make some initial hypotheses.

To start, I got the song count for each album. I find it interesting that the Red album is her longest one, since that is the album where she began to transition from Country music to Pop.

Number of songs for each Taylor Swift album

I also found the 10 most significant words in each album. Since singers (and writers in general) use a lot of filler and transition words (such as “to” , “the”, “I”, etc.) in their songs, I sorted by term frequency instead of raw word count. I also created a tokenization function to remove all stop words since these phrases are excess noise in our analysis.

import spacy
from spacy.lang.en.stop_words import STOP_WORDSdef tokenize_body(comment):
    body = []
    comment = comment.lower()
    comment = comment.replace('\'', '')
    comment = comment.replace('-', ' ')
    value = word_tokenize(comment)
    for i in value:
        #length of word < 5 to get something significant
        if (i in string.punctuation) or(len(i) < 5):
                continue
        else:
            body.append(i)
    return bodyfig, axs = plt.subplots(4,2, figsize=(15, 24))
fig.subplots_adjust(hspace = .3)
axs = axs.ravel()tfidf = TfidfVectorizer(tokenizer = tokenize_body, stop_words = 'english')for i in range(len(album_eda)):
    album_lyrics = finaldf[finaldf['album'] == album_eda['album'][i]]['full_lyrics']
    lyr = list(album_lyrics.apply(pd.Series).stack())
    tf = tfidf.fit_transform(lyr)
    words = tfidf.get_feature_names()
    tf_df = pd.DataFrame(tf.toarray(), columns= words)
    head = tf_df.sum().sort_values(ascending = False)[:10]
    sns.barplot(head.values, head.index, orient = 'h', palette = 'BuPu', ax = axs[i])
    axs[i].set_title(album_eda['album'][i])
    axs[i].set_ylabel('Words')
    axs[i].set_xlabel('Term Frequency')#deleting last plot bc odd number of albums
fig.delaxes(axs[i+1])
fig.show()

10 most significant words for each Taylor Swift Album. Notice the x-axis is not constant across the albums.

The word “you’re” occurs at the top of almost every album, which leads me to believe that the story line for most songs in each albums involve other people. It is also interesting to see that her first three albums did not have as much variety in words as her last four. I hypothesize this is because country songs are not as varied in topics as pop/mainstream music. Once she switched over from country to pop, she diversified what she was singing about.

Topic Modeling

As with any good story, it is likely that there are various topics in each of Taylor Swift’s songs — they can’t be put into a single box. Because of this, performing clustering analysis will not accurately represent the different topics Taylor is singing about. Topic modeling allows us to assign multiple topics to one document (or song in this case). I used LDA to identify the topics that Taylor is singing about as well as what percent of each topic is in each song.

LDA is very sensitive to noise, so it is important to remove excess information that may skew the results. I used the same tfidf function as above to account for this and create the document term matrix to feed into the LDA model.

X = tfidf.fit_transform(finaldf['full_lyrics'])
words = tfidf.get_feature_names()
dtm = pd.DataFrame(X.toarray(), columns= words)
dtm.head()#LDA
n = 3
from sklearn.decomposition import LatentDirichletAllocation as LDA
lda = LDA(n_components = n, random_state = 4)
lda.fit(dtm)#this can help us determine the optimal number of topics
#want high log-likelihood, low perplexity
print("Log Likelihood: ", lda.score(X))
print("Perplexity: ", lda.perplexity(X))num_words = 50
words = tfidf.get_feature_names()
print('Topics')
for idx, topic in enumerate(lda.components_):
        print("\nTopic #" + str((idx+1)))
        print(" ".join([words[i] for i in topic.argsort()[:-num_words:-1]]))

Here are the top 50 words for each topic after performing LDA:

I categorized the topics as ‘Fantasy/Mystical’, ‘Happy-go-lucky/Bright’, and ‘Love’, respectively.

Topic distribution for first and last album

Number of documents for each topic. Love is most common, with fantasy/mystic next and bright/happy-go-lucky last following shortly after

The distribution of topics is pretty even over all of the albums, which tells me that despite the crossover from country to pop, Taylor Swift is still singing about the same topics.

Conclusion

My analysis resulted in 3 common topics among all Taylor Swift songs — fantasy, happy-go-lucky, and love. Over the years she did not steer away from these topics, and despite what genre they are sung in, these topics are what connects Taylor to her fans.

What exactly does Taylor Swift sing about?

Introduction

Dataset

EDA

Topic Modeling

Conclusion

Written by Hannah Li