Lyric Mood Identifier

Published in

Analytics Vidhya

14 min readMay 11, 2020

An NLP classification case about Identifying a song’s positive or negative mood based on their lyrics (English) using R language.

Music is not only able to affect your mood. listening to particularly happy or sad music can even change the way we perceive the world, according to researchers from the University of Groningen. In this modern world, we have the ability to choose what music we want to listen easily. Some music player platforms such as Spotify are known to its music recommender system. where they recommend music based on their customer historical or genre preferences individually. It will be a new idea if music can be enjoyed by its lyrics and will get recommendations based on the mood of the lyrics.

Background

This project is based on this Kaggle dataset . The dataset contains 150k lyric with Valence value gathered using Spotify API. Valence is A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). Our task in this article is to perform supervised NLP sentiment analysis to measure the positiveness of a song. This kind of analysis can be used for the Spotify company itself to improve its music recommender system based on lyric (words).

Limitation: Languange is wide and complex. NLP is also known for its high computational value. So in this analysis, I will only use English song lyrics and sampled the data to only 45k songs.

Note: All libraries and code are processed using Rstudio software as R programming language IDE.

Used library:

# You can load the package into your workspace using the `library()` function
library(dplyr)
library(tidytext)
library(textclean)
library(tm)
library(SnowballC)
library(stringr)
library(rsample)
library(cld2)
library(caret)
library(e1071)
library(tidymodels)

As I said before, the dataset contain 150k lyric and some variables. here’s the glimpse about the dataset

Observations: 158,353
Variables: 5
$ X      <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 2...
$ artist <fct> Elijah Blake, Elijah Blake, Elijah Blake, Elijah Blake, Elijah Blake, Elijah Blake, Eli...
$ seq    <fct> "No, no\nI ain't ever trapped out the bando\nBut oh Lord, don't get me wrong\nI know a ...
$ song   <fct> Everyday, Live Till We Die, The Otherside, Pinot, Shadows & Diamonds, Uno, Girlfriend (...
$ label  <dbl> 0.6260, 0.6300, 0.2400, 0.5360, 0.3710, 0.3210, 0.6010, 0.3330, 0.5060, 0.1790, 0.2090,...

Data Wrangling

Here I show you an example of the lyrics.

head(dat$seq,1)> [1] No, no\nI ain't ever trapped out the bando\nBut oh Lord, don't get me wrong\nI know a couple niggas that do\nI'm from a place where everybody knows your name\nThey say I gotta watch my attitude\nWhen they see money, man they all start actin' strange\nSo fuck with the ones that fuck with you\nThey can never say I'm brand new\n\nIt's everyday, everyday\nEveryday, everyday, everyday\nEveryday, everyday\nEveryday, everyday\nI've been talkin' my shit, nigga that's regular\nI've been lovin' 'em thick, life is spectacular\nI spend like I'ma die rich, nigga I'm flexin', yeah\nEveryday, that's everyday\nThat's everyday\nThat's everyday\nThat's everyday, everyday\n\nI see all of these wanna-be hot R&B singers\nI swear you all sound the same\nThey start from the bottom, so far from the motto\nYou niggas'll never be Drake\nShout out to OVO\nMost of them prolly don't know me though\nI stay in the cut, I don't fuck with no\nBody but I D, that's a pun on No I.D\nWhen nobody know my name\nRunnin' for my dream wasn't hard to do\nYou break bread, I swear they all pull out a plate\nEat with the ones who starved with you\nIf I'm winnin' then my crew can't lose\n\nIt's everyday, everyday\nEveryday, everyday, everyday\nEveryday, everyday\nEveryday, everyday\nI've been talkin' my shit, nigga that's regular\nI've been lovin' 'em thick, life is spectacular\nI spend like I'ma die rich, nigga I'm flexin', yeah\nEveryday, that's everyday\nThat's everyday\nThat's everyday\nThat's everyday, everyday\n\nI heard since you got money\nYou changed, you're actin' funny\nThat's why I gets on my lonely\nYou be lovin' when change is a hobby\nWho do you dress when you ain't got nobody?\n\nIt's everyday, everyday\nEveryday, everyday, everyday\nEveryday, everyday\nEveryday, everyday\nI've been talkin' my shit, nigga that's regular\nI've been lovin' 'em thick, life is spectacular\nI spend like I'ma die rich, nigga I'm flexin', yeah\nEveryday, that's everyday\nThat's everyday\nThat's everyday\nThat's everyday, everyday
135645 Levels: ''Do you want... to have... a tasty... mushroom?' ...

The lyrics are stored in seq column. as you can see it will need a lot of treatment before modeling. The simplest thing we can do first is to remove “\n” as its new line break. The target column (label) still in numeric format. as i said before, higher valence (label) value means the songs are considered as positive mood and lower valence means negative mood. I’ll convert the valence value to binary value labeled `positive` and `negative` with 0.5 as center value. I also want to filter English only lyrics to perform the NLP easier. I’ll use function from `cld2` package to detect the lyric language.

dat$seq <- str_replace_all(as.character(dat$seq), "\n"," ")
# valence with > 0.5 will labelled as potiive, < 0.5 negative
dat$mood <- ifelse(dat$label > 0.5, "positive","negative")
dat$lang <- cld2::detect_language(dat$seq)
# filter the data to english lyric only
dat <- dat[dat$lang == "en",]

let’s see how our data has changed

head(dat$seq,1)> [1] "Who keeps on trusting you When you been cheating Spending your nights on the town? Who keeps on saying That she still wants you When you're through runnin' around? Who keeps on lovin' you When you been lyin' Sayin' things that ain't what they seem?  Well, God does But I don't  God will But I won't And that's the difference Between God and me  God does, but I don't God will, but I won't And that's the difference Between God and me   So, who says she'll forgive you Says that she'll miss you And dream of your sweet memory? Well God does But I don't God will  But I won't And that's the difference Between God and me  God does, but I don't God will, but I won't And that's the difference  Between God and me  God does, but I don't God will, but I won't And that's the difference Between God and me"

That’s only one of many steps we’ll do for text cleaning. Due to my machine limitation, i only use 45k songs for analysis. the songs are selected from random sampling.

In the text cleaning process, I’m more familiar with tm and stringr package but I also want to learn textclean (from tidytext) magic. I’ll use both packages to clean my text data before modeling. Here’s the code of what we’ll do in text cleaning process, the explanation is written in hash (#) inside the code:

dat <- dat %>%
  mutate(text_clean = seq %>%  # select seq column
           str_to_lower() %>%  # convert all the string to low alphabet
           replace_contraction() %>% # replace contraction to their multi-word forms
           replace_internet_slang() %>% # replace internet slang to normal words
           replace_word_elongation() %>% # reolace informal writing with known semantic replacements
           replace_number(remove = T) %>% # remove number
           replace_date(replacement = "") %>% # remove date
           str_remove_all(pattern = "[[:punct:]]") %>% # remove punctuation
           str_squish() %>% # reduces repeated whitespace inside a string.
           str_trim() # removes whitespace from start and end of string
         )

There’s lot of things happen in code above, yet we only need 1 chunk of code to do that. I’ll also convert the text data into corpus and do tokenize using tm package. after that, ill only choose words that appear in 850 songs. This limitation is needed to generalize used words in our model.

corp <- VCorpus(VectorSource(dat$text_clean))corp_dtm <- corp %>% 
  # use pre-build english stopwords
  tm_map(removeWords, stopwords("en")) %>%
  tm_map(stemDocument) %>%
  # convert corpus to document term matrix
  DocumentTermMatrix()
# Find term frequency that appear in at least 850 lyrics
freq_term <- findFreqTerms(corp_dtm, 850)
# 815 words are selected
dat.dtm <- corp_dtm[,freq_term]

Modeling

Naive Bayes

Modeling using NB needs special treatment in the train data. the column represents words and each row represents one single song. NB doesn't need the exact number of each word, it only needs to know if the words are present in the song or not. thus, we convert the value in each cell to contain either 1 or 0. 1 means this specific word is present in the song, 0 means not present.

# split the data. 75% for train data, and 25% for test data
set.seed(1502)
index <- sample(1:nrow(dat.dtm), 0.75*nrow(dat.dtm))train_x <- dat.dtm[index,]
test_x <- dat.dtm[-index,]
# subset label/target variable
train_label <- dat[index,"mood"]
test_label <- dat[-index,"mood"]

Use Bernoulli converter to convert any value above 0 to 1 and 0 to remain 0. we’ll build a custom function to do that then apply the function to train and test data.

# build bernoulli converter function
bernoulli_conv <- function(x){
  x <- as.factor(as.numeric(x>0))
}# apply bernoulli_conv funtion to train and test data
train_x <- apply(train_x,2,bernoulli_conv)
test_x <- apply(test_x,2,bernoulli_conv)

0 in a cell indicates the song doesn’t have a particular word. it also means that the corresponding class-feature combination has a 0 probability of occurring. it will ruin the NB algorithm which computes the conditional a-posterior probabilities of a categorical class variable given independent predictor variables using the Bayes rule. We could specify laplace=1 to enable an add-one smoothing.

Finally, we’ll build Naive Bayes model, do prediction (to test data), and create confusion matrix for later evaluation

# train the model
mod.nb <- naiveBayes(train_x, as.factor(train_label), laplace = 1)
# predict to test data
pred.nb <- predict(mod.nb, test_x,
                   type = "class")# build dataframe for prediction result
pred.nb.x <- cbind(data.frame(pred.nb),test_label)%>%
  setNames(c("pred","actual"))# create confusion matrix 
cf.nb <- confusionMatrix(data = pred.nb.x$pred,
                         reference = pred.nb.x$actual,
                         positive = "positive")

Here’s the result:

cf.nb> Confusion Matrix and StatisticsReference
Prediction negative positive
  negative     4363     2733
  positive     1464     2690
                                               
               Accuracy : 0.6269               
                 95% CI : (0.6179, 0.6359)     
    No Information Rate : 0.518                
    P-Value [Acc > NIR] : < 0.00000000000000022
                                               
                  Kappa : 0.2468               
                                               
 Mcnemar's Test P-Value : < 0.00000000000000022
                                               
            Sensitivity : 0.4960               
            Specificity : 0.7488               
         Pos Pred Value : 0.6476               
         Neg Pred Value : 0.6149               
             Prevalence : 0.4820               
         Detection Rate : 0.2391               
   Detection Prevalence : 0.3692               
      Balanced Accuracy : 0.6224               
                                               
       'Positive' Class : positive

Decision Tree

Next we will build another model using different algorithms. we will use Decision tree, MARS, and Random Forest. All modeling processes (except Naive Bayes) are using parsnip package (embedded in tidymodel package). But before that, we need to make a data frame with cleaned data. The token value will not be converted to 1 or 0 like Naive Bayes. it’ll remain original. And just like before, we’ll also split the data into train and test with 75% and 25% proportion.

dat.clean <- as.data.frame(as.matrix(dat.dtm), stringsAsFactors = F)
# we have 800+ variable in words form. i change the label name from `mood` to labelY to avoid overwriting column names
new.dat <- cbind(dat.clean, data.frame(labelY = dat$mood))# splitting dataset
set.seed(1502)
splitter <- initial_split(new.dat, prop = 0.75, strata = "labelY")
train <- training(splitter)
test <- testing(splitter)

Finally, let's build Decision Tree model, do prediction (to test data), and create a confusion matrix for later evaluation.

# Train the model
mod.dt <- decision_tree(mode = "classification") %>%
  set_engine("rpart") %>% fit(labelY~., data = train)pred.dt <- predict(mod.dt, test, 
                   type = "class")# build dataframe for prediction result
pred.dt.x <- as.data.frame(cbind(pred.dt, test$labelY)) %>%
  setNames(c("pred","actual"))# create confusion matrix
cf.dt <- confusionMatrix(data = pred.dt.x$pred,
                         reference = pred.dt.x$actual,
                         positive = "positive")

Here’s the result from decision tree model:

cf.dt> Confusion Matrix and StatisticsReference
Prediction negative positive
  negative     3671     2404
  positive     2188     2986
                                              
               Accuracy : 0.5918              
                 95% CI : (0.5826, 0.6009)    
    No Information Rate : 0.5208              
    P-Value [Acc > NIR] : < 0.0000000000000002
                                              
                  Kappa : 0.1808              
                                              
 Mcnemar's Test P-Value : 0.00151             
                                              
            Sensitivity : 0.5540              
            Specificity : 0.6266              
         Pos Pred Value : 0.5771              
         Neg Pred Value : 0.6043              
             Prevalence : 0.4792              
         Detection Rate : 0.2654              
   Detection Prevalence : 0.4600              
      Balanced Accuracy : 0.5903              
                                              
       'Positive' Class : positive

Mars

Next, we build the 3rd model using MARS algorithm

# train mars model
mod.mars <- mars(mode = "classification") %>%
  set_engine("earth") %>% fit(labelY~., data = train)pred.mars <- predict(mod.mars, test, 
                   type = "class")# build dataframe for prediction result
pred.mars.x <- as.data.frame(cbind(pred.mars, test$labelY)) %>%
  setNames(c("pred","actual"))# create confusion matrix
cf.mars <- confusionMatrix(data = pred.mars.x$pred,
                         reference = pred.mars.x$actual,
                         positive = "positive")

Here’s the result from Mars model:

cf.mars> Confusion Matrix and StatisticsReference
Prediction negative positive
  negative     4403     2880
  positive     1456     2510
                                               
               Accuracy : 0.6145               
                 95% CI : (0.6055, 0.6236)     
    No Information Rate : 0.5208               
    P-Value [Acc > NIR] : < 0.00000000000000022
                                               
                  Kappa : 0.2195               
                                               
 Mcnemar's Test P-Value : < 0.00000000000000022
                                               
            Sensitivity : 0.4657               
            Specificity : 0.7515               
         Pos Pred Value : 0.6329               
         Neg Pred Value : 0.6046               
             Prevalence : 0.4792               
         Detection Rate : 0.2231               
   Detection Prevalence : 0.3526               
      Balanced Accuracy : 0.6086               
                                               
       'Positive' Class : positive

Random Forest

One of my favorite algorithms and also the most hated (RAM killer af). This love-hated model needs special treatment for the column name. Column names like break, for, next, if are considered as special character thus raises an error when building random forest and model tuning.

## i store the train and test data to new variabel so the old one remain reproducible
train_tune <- train
test_tune <- testcolnames(train_tune) <- make.names(colnames(train_tune))
colnames(test_tune) <- make.names(colnames(test_tune))# build 5 folds cross validation for tuning evaluationn
set.seed(1502)
folds <- vfold_cv(train_tune, 3)

After we change column name (read: change the special character), we’ll build the model same as before.

# Train Random Forest model
mod.rf <- rand_forest(trees = 500, mtry = 5, mode = "classification") %>%
  set_engine("ranger") %>% fit(labelY~., data = train_tune)pred.rf <- predict(mod.rf, test_tune, 
                   type = "class")# build dataframe for prediction result
pred.rf.x <- as.data.frame(cbind(pred.rf, test_tune$labelY)) %>%
  setNames(c("pred","actual"))# create confusion matrix
cf.rf <- confusionMatrix(data = pred.rf.x$pred,
                         reference = pred.rf.x$actual,
                         positive = "positive")

Here’s the result from Random Forest model:

cf.rf> Confusion Matrix and StatisticsReference
Prediction negative positive
  negative     4140     2019
  positive     1719     3371
                                               
               Accuracy : 0.6677               
                 95% CI : (0.6589, 0.6764)     
    No Information Rate : 0.5208               
    P-Value [Acc > NIR] : < 0.00000000000000022
                                               
                  Kappa : 0.3328               
                                               
 Mcnemar's Test P-Value : 0.000001006          
                                               
            Sensitivity : 0.6254               
            Specificity : 0.7066               
         Pos Pred Value : 0.6623               
         Neg Pred Value : 0.6722               
             Prevalence : 0.4792               
         Detection Rate : 0.2997               
   Detection Prevalence : 0.4525               
      Balanced Accuracy : 0.6660               
                                               
       'Positive' Class : positive

From all the confusion matrix above, we can see that Random Forest model has the highest accuracy. Sadly I'm not satisfied with the result. the highest Accuracy only 66.75%. I will try to do model tuning for Random Forest model in hope we can get a better result.

Model Tuning for Random Forest

In Random Forest we can do some parameter tuning like trees and mtry parameter. This time, we will do a grid tuning for the number of trees and mtry by given number. we’ll do the grid search 4 times with 3 k-fold cross-validation.

# specify the grid for both parameter
rf.grid <- expand.grid(trees = c(450,500,550,600), mtry = 3:6)rf.setup <- rand_forest(trees = tune(), mtry = tune()) %>%
  set_engine("ranger") %>%
  set_mode("classification")# this tuning takes a very lot of time. if you do this in your PC, please be patient and make sure your machine have at least 8gb of RAM 
rf.tune <- tune_grid(labelY~., model = rf.setup, resamples = folds, grid = rf.grid, metrics = metric_set(accuracy, sens, spec))

The tuning takes a lot of time. After tuning, I know that the parameter for best results are mtry = 6 and trees = 550. Then we’ll rebuild the model using those parameters, do prediction (to test data), and create confusion matrix for later evaluation.

# specify best parameter
best.rfX <- rf.tune %>% select_best("accuracy", maximize = F)#rebuild the model
mod.rf.2X <- rf.setup %>% finalize_model(parameters = best.rfX)
mod.rf.2.new <- mod.rf.2X %>% fit(labelY~., data = train_tune)# predict new model to test data
pred.rf.2 <- predict(mod.rf.2.x, test_tune, 
                   type = "class")# build dataframe for prediction result
pred.rf.2.x <- as.data.frame(cbind(pred.rf.2, test_tune$labelY)) %>%
  setNames(c("pred","actual"))# create confusion matrix
cf.rf.2 <- confusionMatrix(data = pred.rf.2.x$pred,
                         reference = pred.rf.2.x$actual,
                         positive = "positive")

We’ve got very little improvement from accuracy 66.7 to 66.9

huft… not bad i guess

Model evaluation and conclusion

Let’s combine all the confusion matrix to make the evaluation easier

df.nb <- data.frame(t(as.matrix(cf.nb, what = "classes")))
df.nb <- cbind(df.nb, data.frame(t(as.matrix(cf.nb,what = "overall"))))df.dt <- data.frame(t(as.matrix(cf.dt, what = "classes")))
df.dt <- cbind(df.dt, data.frame(t(as.matrix(cf.dt,what = "overall"))))df.mars <- data.frame(t(as.matrix(cf.mars, what = "classes")))
df.mars <- cbind(df.mars, data.frame(t(as.matrix(cf.mars,what = "overall"))))df.rf <- data.frame(t(as.matrix(cf.rf, what = "classes")))
df.rf <- cbind(df.rf, data.frame(t(as.matrix(cf.rf,what = "overall"))))df.rf.2 <- data.frame(t(as.matrix(cf.rf.2, what = "classes")))
df.rf.2 <- cbind(df.rf.2, data.frame(t(as.matrix(cf.rf.2,what = "overall"))))all.eval <- rbind(Naive_Bayes = df.nb, 
                  Decision_Tree = df.dt,
                  Mars = df.mars,
                  Random_Forest = df.rf,
                  Random_Forest_tuned = df.rf.2) %>%
  select("Accuracy","Sensitivity","Specificity","Precision","F1") %>% data.frame()

here’s a screenshot of the result:

Since there’s no urgency in this case, we will choose Accuracy as our high-priority metric to solve the case. Users can easily remove or skip if they don't like the recommended songs and it will not affect our operational cost. Positive song in sad song playlist will not harm anyone but its better if we try to avoid it.

As we can see from the table above, Random Forest tuned model has the highest Accuracy. It’ll always possible to have higher accuracy (or other metrics) if we try another classification model. We’ll do that in the future. So in conclusion, we’ll use Random Forest model to predict the song’s mood based on its lyric.

Predicting new given lyric

what if I have a lyric that not included in the dataset, can I use this algorithm to predict its mood?

Yes, you can

we only cover approximately 45k songs. there are thousands if not million songs worldwide and it’s such a shame if we can’t predict the mood given the song’s lyric. so here we will try to build a function to suit a plain new lyric text into our model. The data will be cleaned up automatically before we predict their mood.

here I will use a song from One Piece OST opening 3 titled ‘hikari e’ (to the light) as a sample. the song is originally Japanese but I translate it to match our existing model.

# new text lyric
text <- "I've just now begun to search, over the splashing waves
For the everlasting world
With this overflowing passion in my chest, I will go anywhere
Seeking the light yet unseen.When  the summer sun shakes my heart's sail
That's the signal to open the door to a new world
Swaying on the waves, supassing my despair
Aiming for the other side of the horizon.I've just now begun to search, over the splashing waves,
For the everlasting world
With this overflowing passion in my chest, I will go anywhere,
Seeking the light yet unseen.A current of repetitious days and mundane clouds
I see reflected in you a future you can't possibly know
Even if I avoid pain by not changing
That leaves me without dreams or even hope -- so let's go!.Why am I searching?  What is it I want?
The answer is surely somewhere ahead
My heart will go on to the moving world
Hiding my yet unseen strength.Why am I searching?  What is it I want?
Where is the yet unseen treasure?
With this overflowing passion in my chest, how far can I go?
I don't know, butI've just now begun to search, over the splashing waves,
For the everlasting world
With this overflowing passion in my chest, I will go anywhere,
Seeking the light yet unseenTo the other side"

First of all, we need to convert the lyric to data frame to like what our algorithm use before. I’ll use Random Forest model to predict this lyric since that’s our best model. Next we’ll build the function to automatically clean the lyrics and convert it to the required shape. it's just all the cleaning step combined into one function and build new data frame as the output. it also matching words as predictor variables to required column names (word in this case) in train data.

textcleaner <- function(x){
  x <- as.character(x)
  
  x <- x %>%
    str_to_lower() %>%
    replace_contraction() %>%
    replace_internet_slang() %>%
    replace_word_elongation() %>%
    replace_number(remove = T) %>%
    replace_date(replacement = "") %>%
    str_remove_all(pattern = "[[:punct:]]") %>%
    str_squish() %>%
    str_trim()
  
  xdtm <- VCorpus(VectorSource(x)) %>%
    tm_map(removeWords, stopwords("en")) %>%
    tm_map(stemDocument) %>% 
    DocumentTermMatrix(control = list(
      dictionary = names(train_tune)
    ))
  
  dfx <- as.data.frame(as.matrix(xdtm), stringAsFactors=F)
    
}

after that, we’ll apply the function to sample lyrics and predict the song’s mood using Random Forest model.

# apply textcleaner function to sample text
samptext <- textcleaner(text)predict(mod.rf.2.x,samptext)> 
.pred_class
<fctr>
negative

The Random Forest model predicts the lyric as a negative-mood song. if you hear the real song, its actually a spirit, energic, and positive mood music but I never know what’s the lyrics actually say.

Thank you!

please leave a comment if you want to discuss. I also take all critics so I can keep learning.

Reference

Github repository for this project
Rpubs documentation
Kaggle dataset and kernel
Algoritma: place where I work and learn data science