Still Blue Christmas — Still Data Driven: Using data to analyze the mood of Christmas music

5 min readDec 4, 2022

In 2017, Caitlin Hudon used a data-driven method to find the most depressing Christmas songs. In the five years since, there has been an abundance of jolly (or not so) jams released since. Including this instant classic from Lil Jon which graced our playlists in 2018.

Since my family forces me to listen to Christmas music from Thanksgiving until Christmas day, I decided to go ahead and update the analysis and add a few extra pieces. The goal is to identify the saddest and happiest Christmas tracks on Spotify in the hopes of making some banger playlists. To update the analysis I took a similar approach to Caitlin back in 2017.

Identify Christmas songs on Spotify
Get Spotify audio features
Calculate Sadness using a measure of distance
Get song lyrics using the GeniusR
Sentiment score the song lyrics
Analyze the data using unsupervised techniques

I am going to break this analysis up into three posts. First, I will explain how I extracted the songs and audio features from Spotify. I will then show how I calculated the musical sadness metric and share some exploratory analysis to see how we can improve the measure. In part two, I will share how I scraped the lyrics using the GeniusR package and do some NLP and sentiment analysis to calculate lyrical sadness. The final part will dive into the analysis of the data and create two Spotify playlists.

Identifying Songs

In the original post, the author pulled her own playlists of 60 songs. Instead of limiting the analysis to songs I can identify, I crowd-sourced using the SpotifyR package to search for playlists that contained the word “Christmas”. While I identified many more songs, this method led to duplicationand non-Christmas specific songs. I discuss how to hande the duplication in the analysis.

spotifyr::playlists = search_spotify("Christmas", "playlist", limit = 50 )

Below is a sample of the 50 playlists.

Using SpotifyR functions and loops, I pulled all of the tracks in these playlists and deduplicated them. In all, there are 4,243 unique Christmas songs. Below you can see which artists have the most Christmas songs and the most popular Christmas songs.

There is likely some more analysis I can do on the tracks themselves (i.e. what are the most common Christmas Covers?) but I’ll look at that in a later post. Next, I need to capture the audio features of the identified tracks.

Getting Audio Features

Spotify provides 11 audio features for each song. These features can help me to calculate the sadness of each song. We are interested in valence and energy for our measure of sadness.

Valence is “A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).”
Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.

To get the audio features I ran the loop below.

track_features = data_frame()
track_count = nrow(christmas_tracks)
## Loops over the tracks and gets the audio features for each track
for (i in 1:track_count){
  tmp = christmas_tracks
  tmp = get_track_audio_features(ids = christmas_tracks[i,4] )
  track_features = track_features %>% bind_rows(tmp)
}
## Joins the track and feature data together and drops unneeded columns
christmas_tracks = christmas_tracks %>% 
  left_join(tmp_1,by = c('track.id'='id')) %>% 
  select(track.id, track.name, artist.name,danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,key,track.popularity,duration_ms )

Before I calculate anything, I first want to look at the distributions of audio features. The beehive plot below shows the distributions of each of the audio features. Interestingly, valence and energy both skew lower for Christmas songs. There is also a bimodal distribution for both instrumentalness and acousticness. I will explore all of these distributions more later.

Calculating Musical Sadness

Now that I have the audio features I can develop a measure of sadness. In the original post, Caitlin calculated the distance between each point and (0,0). I can do this by calculating the Euclidean distance. Below shows the Energy vs Variance. I have colored the points below or above the mean distance.

Below are the top 25 saddest songs according to the measure of sadness.

As a comparison, here are the top least sad songs.

Before I wrap up, I compared the sadness of each song to track popularity. As you can see by the plot below, there is no real relationship when it comes to Christmas songs. Either people enjoy listening to sad songs at Christmas or does my measure of energy vs variance not tell the complete story?

In the next post, I will do my best to capture the song lyrics from the website, Genius. I will use the captured lyrics to measure sentiment and do some NLP to better define “sadness” in these songs.

Still Blue Christmas — Still Data Driven: Using data to analyze the mood of Christmas music

Identifying Songs

Getting Audio Features

Calculating Musical Sadness

Written by Dan Larson