General Data Analysis on Taylor Swift Songs and Albums in R

Ajaytomgeorge
6 min readFeb 6, 2022

--

Introduction

Today we are gonna analyze Taylor swifts songs for its general characteristics. We are not gonna delve into advanced analysis like linear relationships and usage of anova model. Lets stick to the basics

Initialization

We will be analyzing data from this repo and instructions to load it in R is given in the repo and it is just one line command. Next we will install some packages in R for our visualizations.

#install.packages("fmsb")
#install.packages("PerformanceAnalytics")
#install.packages("textdata")
#install.packages("tidytext")
#install.packages("devtools")
#install.packages('inspectdf')
library(fmsb)
library(ggplot2)
library(inspectdf)
library(dplyr)
library(PerformanceAnalytics)
library(tidytext)
library(tidyverse)
album_songs <- taylor::taylor_album_songs
all_songs <- as.data.frame(taylor::taylor_all_songs)
albums <- as.data.frame(taylor::taylor_albums)

Data Cleaning and Restructring

We have different types of data in the package and we want to perform first clean and restructure the data. Lets take each one by one

Numerical Data

Numerical is fairly straightforward, you can use sapply and is.numeric to0. I am also checking data which are standardised i.e between 0 and 1, because when we plot standardised data with regular numerical data., it will be like a blip in the ocean!

num<-which(sapply(album_songs, is.numeric))
num_names<-as.vector(names(num))

numerical_data<-album_songs[,sapply(album_songs, is.numeric)]
#Between 0 and 1
standard_data<-as.data.frame(do.call(cbind, lapply(numerical_data, summary)))[c(1,4,6),c(2,3,7,8,9,10,11)]
standard_data <- rbind(rep(1,5) , rep(0,5) , standard_data)
other_num_data<-numerical_data[, -which(names(numerical_data) %in% names(standard_data))]

Categorial Data

For categorical data, we are making sure we drop the NA values. We are also sub setting some sections for detailed analysis

categorical_data<-album_songs[, -which(names(album_songs) %in% num_names)]featuring<-categorical_data["featuring"] %>% drop_na(featuring)
promotional_release<-categorical_data["promotional_release"] %>% drop_na(promotional_release)
single_release<-categorical_data["single_release"] %>% drop_na(single_release)
cleaned_categorical_data<-subset(categorical_data,select=-c(featuring,promotional_release,single_release))
categorical_analysis <- rbind(featuring %>% inspect_cat(), promotional_release %>% inspect_cat, single_release %>% inspect_cat(), cleaned_categorical_data %>% inspect_cat())

Lyrical Data

The lyrical information of each song is stored as sub tibble and I am iterating through the lyrics appending them so that I can later perform text analysis and sentiment analysis

lyrical_data<-album_songs[,29]expanded_lrics<-data.table::rbindlist(album_songs$lyrics)
expanded_lrics$Album<-c(NA)
expanded_lrics$Track<-c(NA)
bookmark<-1
for(i in 1:nrow(album_songs)){
row<-album_songs[i,]
for (j in 1:nrow(row$lyrics[[1]]) )
{expanded_lrics$Album[bookmark]<-row$album_name
expanded_lrics$Track[bookmark]<-row$track_name
bookmark<-bookmark+1
}
}
tay<-expanded_lrics
tay_tok <- tay%>%
unnest_tokens(word, lyric)
tidy_taylor <- tay_tok %>%
anti_join(stop_words)

Data Analysis — Numerical Data

Radar Plot(Mean Values)

colors_border=c( rgb(0.2,0.5,0.5,0.9), rgb(0.8,0.2,0.5,0.9) , rgb(0.7,0.5,0.1,0.9) )
colors_in=c( rgb(0.2,0.5,0.5,0.4), rgb(0.8,0.2,0.5,0.4) , rgb(0.7,0.5,0.1,0.4) )
radarchart( standard_data , axistype=1 ,
#custom polygon
pcol=colors_border , pfcol=colors_in , plwd=4 , plty=1,
#custom the grid
cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,1,5), cglwd=0.8,
#custom labels
vlcex=0.8
)
legend(x=1.2, y=1, legend = rownames(standard_data[-c(1,2),]), bty = "n", pch=20 , col=colors_in , text.col = "grey", cex=1.2, pt.cex=3)
  • Songs has good dancebality, energy, acousticness and valence
  • Speechiness, instrumentalness and liveness are on the lower side

Boxplot

sum_df<-do.call(cbind, lapply(numerical_data, summary))
summary<-rbind(sum_df,numerical_data %>% summarise_if(is.numeric, sd))[c(4,7),]
rownames(summary)<-c("Mean","SD")
summary
## track_number danceability energy key loudness mode
## Mean 9.877301 0.5867423 0.5908896 4.601227 -7.215466 0.8957055
## SD 5.844458 0.1163803 0.1852035 3.302711 2.638172 0.3065841
## speechiness acousticness instrumentalness liveness valence tempo
## Mean 0.05418957 0.3132342 0.00260274 0.14063742 0.4200933 124.93952
## SD 0.05333396 0.3294357 0.01921601 0.07629626 0.1904104 31.72381
## time_signature duration_ms
## Mean 3.9877301 237337.60
## SD 0.2937146 38653.41
par(cex.axis=.6)
boxplot(standard_data)
  • For Dancebility, energy, acousticness and valence, the plots are tall and indicates quite variability among songs and there doesn’t seem to be having any outliers
  • Liveness and instrumentalness are on the lower side though there are outliers for both.
  • +Speechiness has long upper whisker which means that Speechiness is varied amongst the most positive quartile group
  • Similarly dancebility has long lower whisker which means dancebility is varied amongst the least positive quartile range

Correlation chart — Histogram and Scatterplot Matrix

chart.Correlation(numerical_data, histogram = TRUE)

Histogram

  • Dancebility and Duration seems to be normally distributed
  • Energy and Loudness seems skewed to the left

Scatterplot matrix- Correlation

  • Energy vs Loudness — High Positive Correlation — 0.78
  • Acousticness vs Loudness — High Negative Correlation — (-0.76)
  • Energy vs Acousticness — High Negative Correlation — (-0.69)
  • Energy Vs Valence — .50 Positive Correlation
  • Other notable correlation
  • Energy vs liveness — 0.24
  • Loudness vs liveness — 0.28
  • Dancebility vs Valence — 0.38
  • Loudness vs Valence — 0.33
  • Dancebility vs Duration_ms — (-0.28)
  • Speechiness vs Duration_ms — (-0.32)
  • Valence vs Duration_ms (- 0.44)
  • Time_signature vs Duration_ms (-0.32)

Categorical Data

categorical_analysis## # A tibble: 14 x 5
## col_name cnt common common_pcnt levels
## <chr> <int> <chr> <dbl> <named lis>
## 1 featuring 11 Bon Iver 16.7 <tibble [1~
## 2 promotional_release 13 2010-11-08 14.3 <tibble [1~
## 3 single_release 44 2006-06-19 2.27 <tibble [4~
## 4 album_name 9 Fearless (Taylor's Version) 16.0 <tibble [9~
## 5 album_release 9 2021-04-09 16.0 <tibble [9~
## 6 artist 1 Taylor Swift 100 <tibble [1~
## 7 bonus_track 2 FALSE 88.3 <tibble [2~
## 8 ep 1 FALSE 100 <tibble [1~
## 9 explicit 2 FALSE 93.3 <tibble [2~
## 10 key_mode 19 C major 15.3 <tibble [1~
## 11 key_name 12 C 17.2 <tibble [1~
## 12 mode_name 2 major 89.6 <tibble [2~
## 13 track_name 163 'tis the damn season 0.613 <tibble [1~
## 14 track_release 34 2021-04-09 14.1 <tibble [3~
categorical_analysis %>% show_plot()

Observations

  • Bon Iver is the most featured artist
  • Most promotional_release was in 2010–11–08
  • Most songs are from album Fearless
  • Most releases were on date 2021–04–09
  • Majority doesn’t have bonus track
  • C major, G major, D major and F Major seems dominate her songs
  • Major seems to dominate her music

Detailed Lyrical Data Analysis

tidy_taylor %>%
count(word, sort = TRUE) %>%
#filtering to get only the information we want on the plot
filter(n > 70,
word != "di",
word != "ooh",
word != "ey")%>%
ggplot(aes(x = reorder(word, n), y = n))+
geom_bar(stat="identity")+
geom_text(aes(label = reorder(word, n)),
hjust = 1.2,vjust = 0.3, color = "white",
size = 5)+
labs(y = "Number of times mentioned",
x = NULL,
title = "Most frequent words in Taylor Swift lyrics",
caption = "@ajaytomgeorge")+
coord_flip()+
ylim(c(0, 300))+ # I didn't want to have the bars covering the whole plotting area
theme_minimal()+
#now making more visually appealing
theme(plot.title = element_text( hjust = 0.5,vjust = 3, color = "blue3", size = 14, family="Forte"),
axis.text.y = element_blank(),
axis.text.x = element_text(size = 8, color = "grey40"),
axis.title.x = element_text(size = 10, color = "grey40"),
plot.caption = element_text(size = 7.5, color = "grey40"),
plot.margin=unit(c(2,1,1.5,1.2),"cm"))

Word Frequency Plot

  • Love and time seems to dominate other words by a large margin
  • Top 5 words are positive or nuetral

Sentiment Analysis

tay_sentiment <- tidy_taylor%>%
inner_join(get_sentiments("bing"))%>%
count(Album, Track, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
tay_sentiment%>%
ggplot(aes(reorder(Track, sentiment), sentiment, fill = Album)) +
geom_col(show.legend = FALSE) +
facet_wrap(~Album, ncol = 3, scales = "free")+
scale_fill_manual(values = c("skyblue1", "lightgoldenrod1", "mediumorchid3", "red2", "plum1", "slategray","mediumorchid3", "red2", "plum1"))+
labs(x = NULL,
y = "Sentiment",
title = "Taylor Swift's songs ranked by sentiment",
caption = " ajaytomgeorge")+
theme_minimal()+
theme(plot.title = element_text(size = 13, hjust = 0.4, face = "bold"),
axis.title.y = element_text(hjust = 0.05, size = 7, color = "grey40", angle = 0),
axis.title.x = element_text(size = 8, color = "grey40"),
axis.text.x = element_text(size = 6.5, color = "grey40"),
axis.text.y = element_text(size = 6.5, color = "grey40"),
strip.text = element_text(size = 9, color = "grey40", face = "bold"),
plot.caption = element_text(size = 7.5, color = "grey40"))+
coord_flip()
boxplot(tay_sentiment$sentiment,medcol = "red", boxlty = 0, whisklty = 1)
  1. There is an overall tendency of sentiment of her songs is slighty negative with many outliers in very positive and very negative songs
  2. Evermore and Folklore seems to gave negative sentiment songs

--

--