‘this is me trying’: Clustering Taylor Swift’s Discography with K-means

Published in

Web Mining [IS688, Spring 2021]

8 min readMay 4, 2021

Taylor Swift, to say the least, has had an immense impact on the modern music industry. Now I’m not just saying this as a Swiftie; before I became a Swiftie (soon after she released reputation in 2017), I couldn’t doubt that Taylor Swift had some of the most iconic pop songs on the radio, nostalgic country jams, and overall, music for whatever mood you’re in. Being an active artist since she was 15, Taylor has 10 studio albums so far, 1 of them being a re-recording of Fearless (aka Fearless (Taylor’s Version)). Her songwriting and music production is some of the best, in my opinion. She even has the awards to prove it with 11 Grammy’s (3 of them being for Album of the Year) [1].

Each of the covers for Taylor’s first nine studio albums | Source: https://i.pinimg.com/originals/47/d0/29/47d0296c1080cb8ba3697b2e8bf4d7d8.jpg

Cover of Fearless (Taylor’s Version) which includes songs ‘from the vault’ | Source: https://www.nme.com/reviews/album/taylor-swift-fearless-taylors-version-review-2916595

With Taylor having a wide range in songs, in both content and genre, I decided I wanted to cluster her discography with the k-means algorithm (more on that in a bit) to find what songs of hers are considered similar according to her lyrics. For example, do breakup songs and love songs get clustered into their own categories? I’m certainly not the first one to do this (see [2], [3], [4], [5] for other similar studies), but I noticed in browsing other people’s analyses that neither folklore and evermore hadn’t been added yet, nor any of Taylor’s newest songs ‘from the vault’ off of her re-recorded Fearless album. These are Taylor’s three newest studio albums all released in about the past year.

Source: https://www.analyticsvidhya.com/blog/2021/04/k-means-clustering-simplified-in-python/

The k-means algorithm is a clustering method using a k value to determine the number of clusters that will group a set of data points together. The way data points can be grouped together varie according to the k value. In the example on the left, we have a plot of data points and once a k value of 3 is determined, each data point is assigned to one of the 3 clusters using a distance metric. This analysis will use this algorithm.

The Process

The songs included in this analysis consist of all of the tracks from her studio albums:

Taylor Swift
Fearless Platinum Edition
Speak Now (Deluxe Edition)
Red (Deluxe Edition)
1989 (Deluxe Edition)
reputation
Lover
folklore (deluxe version)
evermore (deluxe version)
Fearless (Taylor’s Version)

Additionally, no songs were included more than once, aka no remixes, acoustic/piano versions, live performances, etc. Taylor’s Christmas album and other single tracks or collabs (i.e. I Don’t Wanna Live Forever, Only The Young, etc.) were also excluded. Songs that were in both Fearless Platinum Edition and Fearless (Taylor’s Version) were only counted once.

1. Collect Data on Taylor’s Discography

Libraries Used In This Analysis: lyricsgenius, pandas, sklearn, matplotlib, seaborn

Table of the title, album, and lyrics of each of Taylor’s songs

To analyze and cluster Taylor’s music, I needed to collect the lyrics to her songs. I used the library, lyricsgenius, to import lyrics from Genius.com into a pandas data frame in a Jupyter notebook. This is an API for Genius.com that allows you to obtain information on artists, albums, and tracks. During this process, I came across a minor issue in which retrieval of some of the lyrics was in a language other than English. This turned out to be because the lyrics obtained were from the top hit song, which may happen to be a translation of whatever the original song is. With a small enough dataset and efficient Swiftie knowledge, I manually inputted the remaining missing lyrics myself.

2. Clean Up the Lyric Data
Before diving into the analysis, I needed to clean up the data to exclude stop words and lowercase all of them as well. In addition to the standard stop words provided in sklearn.feature_extraction, I also excluded “just”, “yeah/ya”, pronouns, possessive contractions, and words like “ooh”, “aah,” etc.

3. Using The TF-IDF Vectorizer

TF-IDF stands for Term Frequency Inverse Document Frequency. It is a way to measure how important a word is in a document. There are two parts to this: Term Frequency (how often a term appears in a document) and Inverse Document Frequency (how common or rare a term is in a document). Together this statistic forms TF-IDF, a weighted value commonly used in text mining and information retrieval. [6] This Medium article [7] goes further into the mechanics of the TF-IDF vecotirzer for anyone interested. Thus, I applied this method to each of Taylor’s song lyrics. The figure below shows a sneakpeak of the TF-IDF table with each identified term and its weighted value in each of the songs. Values of 0.0 mean that word doesn’t appear in the song.

TF-IDF table with each document and its weighted value in each song

4. Finding the Appropriate Number of Clusters or Value of k

Before being able to implement the k-means algorithm, it is appropriate to try to find a k value to determine a number of clusters. The k-means algorithm requires a pre-determined k value as the number of clusters a set of data points will be grouped together from. To do this, I use the Elbow Method to find the optimal number of clusters, testing the sum of squared errors (SSE). The point at which the graph below looks like the point of an elbow should be the optimal value for k. The reason for using this point is because any number of clusters after it will have a diminishing return (no additional clusters will add any more value). Although the plot doesn’t look perfect, in this case, I settle with 10 as the value for k.

5. Using NMF for Matrix Values and Topic Extraction

To gain further insight into how songs are clustered, I used NMF (Non-negative matrix factorization) as a means of topic extraction. It’s an unsupervised learning technique also used in dimensionality reduction and source separation. [8] This function requires the number of topics (I use my k value here), the text, and TF-IDF vectorizer. The image below shows the top 15 words associated for the first 5 topics. A more detailed and color-coded-friendly version of the table will be further down the article. I also calculate values for an NMF matrix of each song.

6. Implementing t-SNE and K-Means to Cluster

Finally, I can implement the k-means algorithm on the values from the NMF matrix. Then, I use t-SNE to help cluster these songs and their lyrics. t-SNE stands for t-distributed Stochasitc Neighbor Embedding. It is a method for visualizing high-dimensional data by calculating a similarity metric between pairs of data points in the high dimensional and low dimensional spaces [9]. I was also considering using PCA (principal component analysis) but after some test runs, stuck with using t-SNE as a means to cluster the text data. In the plot below, those metrics are what x1 and x2 represent. Data points (songs) are clustered together accordingly.

Thoughts on the Results as a Swiftie

With the data clustered, I looked at what songs were in each cluster. The tables below are color-coordinated with the plot above to display what songs are associated with each topic cluster. If you’re a Swiftie or otherwise have some familiarity with at least a handful of Taylor’s discography, then some songs clustered together may not make all that much sense. I can vouch for some songs being grouped together, but not others. For example, Should’ve Said No, Mean, and I Knew You Were Trouble. in cluster 8 are all about songs that call out someone for their actions and/or regret with being with that person. These together make sense, but not necessarily alongside Today Was A Fairytale or King Of My Heart in that same cluster (songs about love and meeting someone amazing). Similarly in cluster 0, Teardrops On My Guitar, Fifteen, If This Was A Movie, Cruel Summer, and betty (songs about longing for someone or something) are grouped together, which again make sense. However, it makes less sense with Bad Blood or seven. There’s also an issue of Blank Space and cowboy like me having their own cluster whereas other clusters have numerous songs in them. Generally, there is an imbalance between the number of songs in each cluster. This leads me to think that the methods need to be modified in some way shape or form.

Limitations and Conclusions

If some songs in some clusters fit together and not others, it may be an issue with the number of clusters or something else entirely.
As mentioned earlier, I used t-SNE instead of PCA for dimensionality reduction, which may have lead to some inaccuracies.
A major limitation is that the analysis doesn’t take into account music production or genre; this is more about songwriting and solely about the content in the tracks. Music production is a major key in songs, including whether songs are in a major or minor key.
There may be a different clustering algorithm (such as DBSCAN) that suits this data better than k-means.

I think k-means is much more useful when it is easier to tell what the number of clusters of something is going to be, like if you’re working on clustering media according to genre. There’s also other ways to visualize the results of clustering, and t-SNE is only one way to do so.

Overall, this was meant to be a fun and exploratory analysis of using the k-means algorithm on data related to something I love. If I wanted to continue this project further, I would look into different clustering algorithms, ways to improve the TF-IDF vectorizer, test out different k values, and implement data about music production. If you’d like to try out messing with the lyric data, I’ve provided a link to the CSV file here.

(Finally, per the title of this article, this is me trying from folklore is definitely worth the listen. If you want my personal recommendations on what you should listen to from Taylor [if you don’t already] feel free to comment or message me.)