What Does BTS Sing About?

An attempt at applying topic modelling to understand BTS’s lyrics.

Kaili

Published in

Geek Culture

9 min readMay 6, 2021

1. Introductions

BTS (방탄소년단) is many things, including

A septet that debuted in 2010 under Big Hit Entertainment
The best-selling artist in South Korean history
43rd in the Forbes Celebrity 100 (2019) as one of the world’s top-earning celebrities

A lot of BTS’s success is credited to what they sing about, usually embedding social issues into their lyrics, and is often praised as being a more authentic k-pop group.

2. Motivation

“First time with BTS?” — Dope

Extension of a side project to collect data on BTS songs
Desire to attempt a text analysis project that did not involve the usual, more commonly explored datasets (news, social media, etc.)
To better understand BTS’s progression as artists thorough understanding of their lyrics

3. Dataset

“Just take ’em take ‘em” — Blood Sweat & Tears

Self-procured and manually cleaned using data from lyrical website Genius and BTS’s official webpage
Consists of: 18 albums, 225 tracks from 2013 to 2021, with English translated lyrics of each track
More details regarding the dataset can be found on the Kaggle page

4. Data Pre-Processing

“snowflakes fall down and get away little by little” — Spring Day

The full code can be found here, this piece of writing attempts to keep the amount of code block embeds to a minimum and most code steps are not reflected, refer to the link to get the full code.

Only Keeping Unique Tracks

Tracks that are not included for analysis beyond exploratory purposes are:

repackaged tracks (songs previously released but included again in later album) — duplicate of an existing track
remix tracks — considered duplicate as their lyrics rarely differ
tracks with the full version ( has_full_ver = TRUE ) — considered duplicate as its lyrics are already represented in the full version tracks
“skit” tracks — with the understanding of BTS’s discography, the band include snippets of conversations or soundbites, known as a “skit” in their tracklist as well
“notes” tracks — In some albums, Genius also has translations of “notes”, which are the printed text brochures included with the physical albums; these will be removed from the analysis as they are not lyrics

Code Block 1. Data Pre-processing: only keeping unique tracks in the data frame

Normalize Lyrics

pre_normalise function — replaces text before the usual text normalization methods (e.g. replace phrases containing special characters)
lyrics functions — methods to deal with various lyrical features (e.g. reducing contractions such as changing ‘dat’ to ‘that’; removing the ‘la la la’ or non-lexical vocables from the lyrics
text functions — methods to perform routine text data cleaning (remove stopwords, lemmatize words, etc.)
replacements — self-defined dictionary (store in a separate file: replacements.py) of replacements to be made in addition to already prescribed text data cleaning

Code Block 2. Data Pre-processing: normalize lyrics

5. Exploratory Data Analysis

“did you see my bag?” — Mic Drop

Code found here

General

Fig 1. Number of albums released yearly (Source: author)

Fig 2. Number of tracks released yearly (Source: author)

Fig 3. Number of tracks per album (Source: author)

Observations

2020 may have seen the highest number of albums released, the number of both total and unique tracks released are not significantly higher than in 2016 and 2018, where half the number of albums were released — likely due to singles released in 2020, each considered a different album (Dynamite Day; Dynamite Night)
The number of unique tracks follows a short phrase of increase before a sharp decrease where total tracks > unique tracks (the decrease usually due to an album is a repackage), suggesting that BTS has a release pattern where one can expect a repackage album after about 3 albums where total < unique tracks
Leveraging personal knowledge, to add to the previous point, BTS commonly release a repackaged album as the last album to a series of albums sharing similar prefixes and themes (e.g. Young Forever as the epilogue to the Most Beautiful Moment in Life series; Answer as the epilogue to the Love Yourself series)

Lyrics

Fig 4. Distribution of lyric word count in unique tracks (Source: author)

Fig 5. The average number of words per unique track by album (Source: author)

Observations

The number of words in BTS lyrics follow a typical normal distribution
The median number of words = 207 words, which supports the previous point that the data has more or less a symmetrical distribution
The average number of words across the albums follow a general decreasing trend, which is interesting as early BTS releases songs that are more ‘gritty’ for lack of a better word (think: angrier, rap heavy), while in the recent past, their songs have been more vocal, which may be what this observation shows

Fig 6. word cloud of the most common words in BTS lyrics (Source: author)

Fig 7. Top 15 most common bigrams in BTS lyrics (Source: author)

Observations

A lot of bigrams are the same words repeated, which is not unexpected given that they are lyrics (e.g. know know, love love, go go, bang bang, run run, want want)
Most of the bigrams are natural bigrams, making it unsurprising that the words are used together, often (e.g. let us, let go, one day, hip hop, feel like, one two)
Most of the bigrams also contain common words, as shown in the word cloud

Fig 8. Top 10 tracks with the most number of words and unique words (Source: author)

Fig 9. Top 10 tracks with the least number of words and unique words (Source: author)

Observations

After normalization, track ‘Interlude: What Are You Doing Now’ has no words – to remove from the data used for further analysis
All the cypher tracks (rap tracks) are in the tracks with the most number of words
Intro and outro tracks, predictably, make up the tracks with the least number of words

Lyrics: Word Significance

Significance measured using TF-IDF, which gives a higher weightage to less common (hence, more meaningful and “interesting”) words

Fig 11. Most interesting words in each album’s lyrics (Source: author)

Observations

The Love Yourself series lives up to the title, the word “love” is the top word of significance in all except one of the album
As a matter of probable fact, “love” is very consistently a relatively significant word
Earlier album surfaces the word “girl”, which looks to not be perhaps as significant after the noticeably lonely album “The Most Beautiful Moment in Life pt. 2”

6. Topic Modelling

“try babbling into the mirror, who the heck are you” — Fake Love

Code found here
Model: Latent Dirichlet Allocation (LDA), with the genism library (unsupervised)
Model measure: Coherence score (c_v)
Approach: experiment with various values of the number of topics(n), alpha (a) and eta (b) to attempt to come up with the best LDA model to arrive at a set of topics that can group BTS lyrics

Prepare Model Inputs

Code Block 3. Prepare LDA model inputs

Utility Function

Function compute_coherence_value will be used to facilitate less code repetition

Code Block 4. Function to initiate and run genism’s LDA model

Base Model

Code Block 5. Base Model

Output: base model c_v: 0.2907

Model 1: Changing the Number of Topics (n)

Code Block 6. model 1 parameters

The range of the number of topics to test is set from 2 to 10 to not only reduce training time, too to reduce overfitting the model to the data

Fig 12. Plotting model 1’s coherence score

For simplicity’s sake, the number of topics (n) that achieves the highest c_v will be used from here on out — n=7
Output: model 1 c_v: 0.2940 — which is an improvement from the base model

Model 2: Changing alpha values (a)

alpha: number of expected topics that expresses our a-priori belief for each topics’ probability

Code Block 7. model 2 parameters

Fig 13. Table of each alpha value and coherence score

Coherence score is highest when a=0.71
Output: model 2 c_v: 0.2932 — which is an improvement from the base model but slightly lower than model 1

Model 3: Changing eta (b) values

eta: A-priori belief on word probability

Code Block 8. model 3 parameters

Fig 14. Table of each eta value and coherence score

Coherence score is highest when b=0.91
Output: model 3 c_v: 0.3977 — which is an improvement from the base model and models 1 and 2

Final Model

Code Block 9. Using model 3 as the final LDA model

Fig 15. 15 words associated with each topic extracted by the LDA model

Giving a name to the numerical topics, based on personal knowledge of BTS’s discography and history:

7. Conclusion

“drink it up. (creature of creation)” — Dionysus

Fig 17. Table showing topics and the number of tracks classified as each topic

The topics are not evenly distributed, as shown in the scant numbers for topics ‘kid love’ and ‘party’
From a perspective with more human (and fan) judgment, the topics extracted by the model are similar to each other

Fig 18. The trend of each topic by album

Fig 19. The composition of each albums’ topics

The number of each topic in the different albums makes sense, the high number of ‘dreamy love’ in the ‘Skool Luv Affair, the peak in ‘missed love’ in the ‘WINGS’ album

Fig 20. Visualizing LDA topics with pyLDAvis

There are no huge areas of overlap between the different bubbles, which is a good sign that the topics are fairly different from each other
At the same time, most of the bubbles are not huge, or prevalent enough for the model to be a good classifier (if one should choose to use it that way), this is also reflected in the low c_v score of the model

Final model c_v : 0.3977

About the c_v score: this LDA model gave the best c_v score in the time spent running this model, suggesting that semantically, this model produces the most coherent topics

Reflections

BTS majority of the time, sing in Korean. The dataset used in the analysis was English translated, which may have introduced some linguistic loss
The dataset only included Korean and the recent English albums, it does not include Japanese tracks and misses some special albums (e.g. mixtapes, game ost albums) — additional data which may better inform the model in topic extraction
The parameters varied to derive the final LDA model (within specific ranges) are limited to the number of topics, alpha, and eta. Potentially, experimenting with varying other model parameters (e.g. number of passes, random state) might lead to a better-resulting model
LDA might not be the best model to extract topics, other methods (e.g. NMF) might be a better approach
Perhaps BTS just refuses to have detectable large groups of similar topics in their discography

References

“I’m curious about everything” — Boy with Luv

Using Machine Learning to Analyze Taylor Swift’s Lyrics — uses NMF to understand Taylor Swift’s artistic progression
What exactly does Taylor Swift sing about? — uses LDA to understand what Taylor Swift sings about
Evaluate Topic Models: Latent Dirichlet Allocation (LDA) — explores different measures of evaluating an LDA model
What is the meaning of a Coherence score of 0.4? Is it good or bad? [closed]
Topic Modeling with Gensim (Python) — a guide to building LDA models with genism
The Mastermind Behind BTS Opens Up About Making a K-Pop Juggernaut — understanding the context

Edits: 16 May 2021, added Fig 19 and corrected some spelling and grammar.

What Does BTS Sing About?

An attempt at applying topic modelling to understand BTS’s lyrics.

1. Introductions

2. Motivation

3. Dataset

4. Data Pre-Processing

Only Keeping Unique Tracks

Normalize Lyrics

5. Exploratory Data Analysis

General

Lyrics

Lyrics: Word Significance

6. Topic Modelling

Prepare Model Inputs

Utility Function

Base Model

Model 1: Changing the Number of Topics (n)

Model 2: Changing alpha values (a)

Model 3: Changing eta (b) values

Final Model

7. Conclusion

Reflections

References

Written by Kaili