What Does BTS Sing About?
An attempt at applying topic modelling to understand BTS’s lyrics.
1. Introductions
BTS (방탄소년단) is many things, including
- A septet that debuted in 2010 under Big Hit Entertainment
- The best-selling artist in South Korean history
- 43rd in the Forbes Celebrity 100 (2019) as one of the world’s top-earning celebrities
A lot of BTS’s success is credited to what they sing about, usually embedding social issues into their lyrics, and is often praised as being a more authentic k-pop group.
2. Motivation
“First time with BTS?” — Dope
- Extension of a side project to collect data on BTS songs
- Desire to attempt a text analysis project that did not involve the usual, more commonly explored datasets (news, social media, etc.)
- To better understand BTS’s progression as artists thorough understanding of their lyrics
3. Dataset
“Just take ’em take ‘em” — Blood Sweat & Tears
- Self-procured and manually cleaned using data from lyrical website Genius and BTS’s official webpage
- Consists of: 18 albums, 225 tracks from 2013 to 2021, with English translated lyrics of each track
- More details regarding the dataset can be found on the Kaggle page
4. Data Pre-Processing
“snowflakes fall down and get away little by little” — Spring Day
The full code can be found here, this piece of writing attempts to keep the amount of code block embeds to a minimum and most code steps are not reflected, refer to the link to get the full code.
Only Keeping Unique Tracks
Tracks that are not included for analysis beyond exploratory purposes are:
- repackaged tracks (songs previously released but included again in later album) — duplicate of an existing track
- remix tracks — considered duplicate as their lyrics rarely differ
- tracks with the full version (
has_full_ver = TRUE
) — considered duplicate as its lyrics are already represented in the full version tracks - “skit” tracks — with the understanding of BTS’s discography, the band include snippets of conversations or soundbites, known as a “skit” in their tracklist as well
- “notes” tracks — In some albums, Genius also has translations of “notes”, which are the printed text brochures included with the physical albums; these will be removed from the analysis as they are not lyrics
Normalize Lyrics
pre_normalise
function — replaces text before the usual text normalization methods (e.g. replace phrases containing special characters)lyrics
functions — methods to deal with various lyrical features (e.g. reducing contractions such as changing ‘dat’ to ‘that’; removing the ‘la la la’ or non-lexical vocables from the lyricstext
functions — methods to perform routine text data cleaning (remove stopwords, lemmatize words, etc.)replacements
— self-defined dictionary (store in a separate file:replacements.py
) of replacements to be made in addition to already prescribed text data cleaning
5. Exploratory Data Analysis
“did you see my bag?” — Mic Drop
- Code found here
General
Observations
- 2020 may have seen the highest number of albums released, the number of both total and unique tracks released are not significantly higher than in 2016 and 2018, where half the number of albums were released — likely due to singles released in 2020, each considered a different album (Dynamite Day; Dynamite Night)
- The number of unique tracks follows a short phrase of increase before a sharp decrease where total tracks > unique tracks (the decrease usually due to an album is a repackage), suggesting that BTS has a release pattern where one can expect a repackage album after about 3 albums where total < unique tracks
- Leveraging personal knowledge, to add to the previous point, BTS commonly release a repackaged album as the last album to a series of albums sharing similar prefixes and themes (e.g. Young Forever as the epilogue to the Most Beautiful Moment in Life series; Answer as the epilogue to the Love Yourself series)
Lyrics
Observations
- The number of words in BTS lyrics follow a typical normal distribution
- The median number of words = 207 words, which supports the previous point that the data has more or less a symmetrical distribution
- The average number of words across the albums follow a general decreasing trend, which is interesting as early BTS releases songs that are more ‘gritty’ for lack of a better word (think: angrier, rap heavy), while in the recent past, their songs have been more vocal, which may be what this observation shows
Observations
- A lot of bigrams are the same words repeated, which is not unexpected given that they are lyrics (e.g. know know, love love, go go, bang bang, run run, want want)
- Most of the bigrams are natural bigrams, making it unsurprising that the words are used together, often (e.g. let us, let go, one day, hip hop, feel like, one two)
- Most of the bigrams also contain common words, as shown in the word cloud
Observations
- After normalization, track ‘Interlude: What Are You Doing Now’ has no words – to remove from the data used for further analysis
- All the cypher tracks (rap tracks) are in the tracks with the most number of words
- Intro and outro tracks, predictably, make up the tracks with the least number of words
Lyrics: Word Significance
- Significance measured using TF-IDF, which gives a higher weightage to less common (hence, more meaningful and “interesting”) words
Observations
- The Love Yourself series lives up to the title, the word “love” is the top word of significance in all except one of the album
- As a matter of probable fact, “love” is very consistently a relatively significant word
- Earlier album surfaces the word “girl”, which looks to not be perhaps as significant after the noticeably lonely album “The Most Beautiful Moment in Life pt. 2”
6. Topic Modelling
“try babbling into the mirror, who the heck are you” — Fake Love
- Code found here
- Model: Latent Dirichlet Allocation (LDA), with the genism library (unsupervised)
- Model measure: Coherence score (c_v)
- Approach: experiment with various values of the number of topics(
n
), alpha (a
) and eta (b
) to attempt to come up with the best LDA model to arrive at a set of topics that can group BTS lyrics
Prepare Model Inputs
Utility Function
- Function
compute_coherence_value
will be used to facilitate less code repetition
Base Model
- Output:
base model c_v: 0.2907
Model 1: Changing the Number of Topics (n)
- The range of the number of topics to test is set from 2 to 10 to not only reduce training time, too to reduce overfitting the model to the data
- For simplicity’s sake, the number of topics (
n
) that achieves the highest c_v will be used from here on out —n=7
- Output:
model 1 c_v: 0.2940
— which is an improvement from the base model
Model 2: Changing alpha values (a)
- Coherence score is highest when
a=0.71
- Output:
model 2 c_v: 0.2932
— which is an improvement from the base model but slightly lower than model 1
Model 3: Changing eta (b) values
- Coherence score is highest when
b=0.91
- Output:
model 3 c_v: 0.3977
— which is an improvement from the base model and models 1 and 2
Final Model
Giving a name to the numerical topics, based on personal knowledge of BTS’s discography and history:
7. Conclusion
“drink it up. (creature of creation)” — Dionysus
- The topics are not evenly distributed, as shown in the scant numbers for topics ‘kid love’ and ‘party’
- From a perspective with more human (and fan) judgment, the topics extracted by the model are similar to each other
- The number of each topic in the different albums makes sense, the high number of ‘dreamy love’ in the ‘Skool Luv Affair, the peak in ‘missed love’ in the ‘WINGS’ album
- There are no huge areas of overlap between the different bubbles, which is a good sign that the topics are fairly different from each other
- At the same time, most of the bubbles are not huge, or prevalent enough for the model to be a good classifier (if one should choose to use it that way), this is also reflected in the low c_v score of the model
Final model c_v : 0.3977
- About the c_v score: this LDA model gave the best c_v score in the time spent running this model, suggesting that semantically, this model produces the most coherent topics
Reflections
- BTS majority of the time, sing in Korean. The dataset used in the analysis was English translated, which may have introduced some linguistic loss
- The dataset only included Korean and the recent English albums, it does not include Japanese tracks and misses some special albums (e.g. mixtapes, game ost albums) — additional data which may better inform the model in topic extraction
- The parameters varied to derive the final LDA model (within specific ranges) are limited to the number of topics, alpha, and eta. Potentially, experimenting with varying other model parameters (e.g. number of passes, random state) might lead to a better-resulting model
- LDA might not be the best model to extract topics, other methods (e.g. NMF) might be a better approach
- Perhaps BTS just refuses to have detectable large groups of similar topics in their discography
References
“I’m curious about everything” — Boy with Luv
- Using Machine Learning to Analyze Taylor Swift’s Lyrics — uses NMF to understand Taylor Swift’s artistic progression
- What exactly does Taylor Swift sing about? — uses LDA to understand what Taylor Swift sings about
- Evaluate Topic Models: Latent Dirichlet Allocation (LDA) — explores different measures of evaluating an LDA model
- What is the meaning of a Coherence score of 0.4? Is it good or bad? [closed]
- Topic Modeling with Gensim (Python) — a guide to building LDA models with genism
- The Mastermind Behind BTS Opens Up About Making a K-Pop Juggernaut — understanding the context
Edits: 16 May 2021, added Fig 19 and corrected some spelling and grammar.