Blue Christmas Part 2- Using Tidytext to calculate the sentiment of Christmas Song Lyrics

Dan Larson
5 min readDec 21, 2022

--

In this series, I am reproducing the 2017 blog post-Blue Christmas by Caitlin Hudon. In the first part, I pulled and analyzed the Spotify features of 4,243 Christmas songs. Using the built-in metrics, I calculated a feature of music sadness. I calculated sadness by taking the distance of Energy and Valence from the (0,0).

In this post, I will share how I extracted lyrics for 1,458 of the songs and used the Tidytext package to do sentiment analysis to come up with a measure of lyrical sadness. To start, I pulled lyrics from the site Genius.

Pulling Lyric data from GeniusR

GeniusR is a package that allows you to interact with the website Genius, which has a large catalog of song lyrics. Using the GeniusR package, I sent a web request to get the lyrics for each song.

To do this, I first had to clean upthe titles of the songs. Genius does not recognize when a track has words like ‘remastered’, ‘remix’, or version. Luckily, Spotify uses the same syntax to add these words to the song title so it is easy to remove. Using the stringr::str_extract function and some regex, I was able to clean up the track name.

christmas_tracks %>% mutate(updated_name = str_extract(track.name, ".*(?=\\-)"),
track.name = ifelse(is.na(updated_name),track.name,updated_name),
track.name = trimws(track.name))

With the cleaned-up track title information, I then looped through the tracks and captured the lyrics.

## Function to run the API call and skip if the process returns an error
try_lyrics = function(name,track){
skip_to_next <- FALSE
tryCatch(

return(paste(get_lyrics_search(artist = name, song = track)$line,collapse = ' ')),error = function(e){skip_to_next <- TRUE})

if (skip_to_next==TRUE) {next}
}

try_lyrics(name = tmp[1,1],track = tmp[1,2])

tmp4=data_frame()
m = 0

for (i in 1:nrow(christmas_tracks)) {

print(i)
text = try_lyrics(name = christmas_tracks[i,1],track=christmas_tracks[i,2])
#Sys.sleep(1)
if (length(text) > 0){m = m+1}
tmp3 = data_frame(track.id = as.character(christmas_tracks[i,4]), artist = as.character(christmas_tracks[i,1]), song=as.character(christmas_tracks[i,2]), lyrics = text)
tmp4 = tmp4 %>% bind_rows(tmp3)
print(paste(m, ' of ', i, ' (', (m/i)*100,')',sep=''))
}

Unfortunately, the hit rate was low and it only resulted in getting lyrics from 1,458 tracks. With some improved work on my process, I can improve the match rate but for now, this will do.

Scoring Sentiment

Sentiment analysis is a process of identifying and extracting opinions from text, determining whether the opinion is positive, negative, or neutral. In addition, sentiment analysis can be used to measure the overall emotion of a text, such as anger, joy, sadness, and so on.

Tidytext is an R library for the tidyverse that allows users to process and analyze text using tidy data principles. It provides functions for importing, summarizing, and visualizing text data and performing sentiment analysis and topic modeling. Tidytext also provides functions for cleaning and preprocessing text data, such as tokenizing, stemming, and removing stopwords.

The code below uses the ‘Bing’ lexicon to calculate sentiment for each of the songs. The sentiment is calculated by comparing the frequency of negative words to positive words. To start, we can look at the frequency of all the words in all the songs.

require(tidytext)


bing = get_sentiments('bing')

christmas_with_sentiment = christmas_with_sentiment %>% unnest_tokens('word',lyrics)

christmas_with_sentiment = christmas_with_sentiment %>%
inner_join(bing) %>%
count(track.id, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)

Analyzing Sentiment

Now that I calculated sentiment, I can analyze the sentiment of the songs. Below is a word cloud that shows the 100 most frequently used words in the songs. The color of the song indicates if it is a positive or negative word. Not surprising that words like Love, Merry, and Happy are among the most commonly used words. You can also see that there are some words that are considered negative that in the context of a Christmas song are not. For example, ‘Ding’ is a word to describe the sound of a bell in a Christmas song but more commonly is associated with negative sentiment.

The next step is to get the frequency of positive and negative words in each song. The histograms below show the distributions for both sets of words. I wasn’t surprised to see that there are more positive words used than negative words.

To calculate the sentiment of each song, I took the positive word count and subtracted the negative word count (sentiment = positive — negative). Below are the top 15 negative and positive scoring songs. You may notice there are some non-Christmas songs on the lists. This is an outcome of how I pulled the Christmas tracks. Since I pulled the songs based on their existence on a Christmas playlist, it inadvertently captured some non-Christmas songs.

It is interesting that only the Glee version of Last Christmas has a high negative sentiment score. This likely because the song is song acapella and the first few measures of the track are just ‘Bum’ repeated.

Comparing Music Sadness to Negative Sentiment

Now that I have a feature for lyric sentiment I can compare it to the music sadness feature. The plot below shows how the two features compare to one another. I like adding the histograms using the ggExtra package to easily show the distributions.

In the next post, I will share a more complete analysis of the sadness of Christmas songs.

--

--

Dan Larson

Data Engineer, father, and sixers fan in Philadelphia.