General Data Analysis on Taylor Swift Songs and Albums in R

6 min readFeb 6, 2022

Introduction

Today we are gonna analyze Taylor swifts songs for its general characteristics. We are not gonna delve into advanced analysis like linear relationships and usage of anova model. Lets stick to the basics

Initialization

We will be analyzing data from this repo and instructions to load it in R is given in the repo and it is just one line command. Next we will install some packages in R for our visualizations.

#install.packages("fmsb")
#install.packages("PerformanceAnalytics")
#install.packages("textdata")
#install.packages("tidytext")
#install.packages("devtools")
#install.packages('inspectdf')
library(fmsb)
library(ggplot2)
library(inspectdf)
library(dplyr)
library(PerformanceAnalytics)
library(tidytext)
library(tidyverse)album_songs <- taylor::taylor_album_songs
all_songs <- as.data.frame(taylor::taylor_all_songs)
albums <- as.data.frame(taylor::taylor_albums)

Data Cleaning and Restructring

We have different types of data in the package and we want to perform first clean and restructure the data. Lets take each one by one

Numerical Data

Numerical is fairly straightforward, you can use sapply and is.numeric to0. I am also checking data which are standardised i.e between 0 and 1, because when we plot standardised data with regular numerical data., it will be like a blip in the ocean!

num<-which(sapply(album_songs, is.numeric))
num_names<-as.vector(names(num))
numerical_data<-album_songs[,sapply(album_songs, is.numeric)]#Between 0 and 1
standard_data<-as.data.frame(do.call(cbind, lapply(numerical_data, summary)))[c(1,4,6),c(2,3,7,8,9,10,11)]
standard_data <- rbind(rep(1,5) , rep(0,5) , standard_data)other_num_data<-numerical_data[, -which(names(numerical_data) %in% names(standard_data))]

Categorial Data

For categorical data, we are making sure we drop the NA values. We are also sub setting some sections for detailed analysis

categorical_data<-album_songs[, -which(names(album_songs) %in% num_names)]featuring<-categorical_data["featuring"] %>% drop_na(featuring)
promotional_release<-categorical_data["promotional_release"] %>% drop_na(promotional_release)
single_release<-categorical_data["single_release"] %>% drop_na(single_release)
cleaned_categorical_data<-subset(categorical_data,select=-c(featuring,promotional_release,single_release))categorical_analysis <- rbind(featuring %>% inspect_cat(), promotional_release %>% inspect_cat, single_release %>% inspect_cat(), cleaned_categorical_data %>% inspect_cat())

Lyrical Data

The lyrical information of each song is stored as sub tibble and I am iterating through the lyrics appending them so that I can later perform text analysis and sentiment analysis

lyrical_data<-album_songs[,29]expanded_lrics<-data.table::rbindlist(album_songs$lyrics)
expanded_lrics$Album<-c(NA)
expanded_lrics$Track<-c(NA)
bookmark<-1
for(i in 1:nrow(album_songs)){
  row<-album_songs[i,]
  for (j in 1:nrow(row$lyrics[[1]]) )
  {expanded_lrics$Album[bookmark]<-row$album_name
  expanded_lrics$Track[bookmark]<-row$track_name
  bookmark<-bookmark+1
  }
}
tay<-expanded_lrics
tay_tok <- tay%>%
  unnest_tokens(word, lyric)tidy_taylor <- tay_tok %>%
  anti_join(stop_words)

Data Analysis — Numerical Data

Radar Plot(Mean Values)

colors_border=c( rgb(0.2,0.5,0.5,0.9), rgb(0.8,0.2,0.5,0.9) , rgb(0.7,0.5,0.1,0.9) )
colors_in=c( rgb(0.2,0.5,0.5,0.4), rgb(0.8,0.2,0.5,0.4) , rgb(0.7,0.5,0.1,0.4) )
radarchart( standard_data  , axistype=1 , 
    #custom polygon
    pcol=colors_border , pfcol=colors_in , plwd=4 , plty=1,
    #custom the grid
    cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,1,5), cglwd=0.8,
    #custom labels
    vlcex=0.8 
    )
legend(x=1.2, y=1, legend = rownames(standard_data[-c(1,2),]), bty = "n", pch=20 , col=colors_in , text.col = "grey", cex=1.2, pt.cex=3)

Songs has good dancebality, energy, acousticness and valence
Speechiness, instrumentalness and liveness are on the lower side

Boxplot

sum_df<-do.call(cbind, lapply(numerical_data, summary))
summary<-rbind(sum_df,numerical_data %>% summarise_if(is.numeric, sd))[c(4,7),]
rownames(summary)<-c("Mean","SD")
summary##      track_number danceability    energy      key  loudness      mode
## Mean     9.877301    0.5867423 0.5908896 4.601227 -7.215466 0.8957055
## SD       5.844458    0.1163803 0.1852035 3.302711  2.638172 0.3065841
##      speechiness acousticness instrumentalness   liveness   valence     tempo
## Mean  0.05418957    0.3132342       0.00260274 0.14063742 0.4200933 124.93952
## SD    0.05333396    0.3294357       0.01921601 0.07629626 0.1904104  31.72381
##      time_signature duration_ms
## Mean      3.9877301   237337.60
## SD        0.2937146    38653.41par(cex.axis=.6) 
boxplot(standard_data)

For Dancebility, energy, acousticness and valence, the plots are tall and indicates quite variability among songs and there doesn’t seem to be having any outliers
Liveness and instrumentalness are on the lower side though there are outliers for both.
+Speechiness has long upper whisker which means that Speechiness is varied amongst the most positive quartile group
Similarly dancebility has long lower whisker which means dancebility is varied amongst the least positive quartile range

Correlation chart — Histogram and Scatterplot Matrix

chart.Correlation(numerical_data, histogram = TRUE)

Histogram

Dancebility and Duration seems to be normally distributed
Energy and Loudness seems skewed to the left

Scatterplot matrix- Correlation

Energy vs Loudness — High Positive Correlation — 0.78
Acousticness vs Loudness — High Negative Correlation — (-0.76)
Energy vs Acousticness — High Negative Correlation — (-0.69)
Energy Vs Valence — .50 Positive Correlation
Other notable correlation
Energy vs liveness — 0.24
Loudness vs liveness — 0.28
Dancebility vs Valence — 0.38
Loudness vs Valence — 0.33
Dancebility vs Duration_ms — (-0.28)
Speechiness vs Duration_ms — (-0.32)
Valence vs Duration_ms (- 0.44)
Time_signature vs Duration_ms (-0.32)

Categorical Data

categorical_analysis## # A tibble: 14 x 5
##    col_name              cnt common                      common_pcnt levels     
##    <chr>               <int> <chr>                             <dbl> <named lis>
##  1 featuring              11 Bon Iver                         16.7   <tibble [1~
##  2 promotional_release    13 2010-11-08                       14.3   <tibble [1~
##  3 single_release         44 2006-06-19                        2.27  <tibble [4~
##  4 album_name              9 Fearless (Taylor's Version)      16.0   <tibble [9~
##  5 album_release           9 2021-04-09                       16.0   <tibble [9~
##  6 artist                  1 Taylor Swift                    100     <tibble [1~
##  7 bonus_track             2 FALSE                            88.3   <tibble [2~
##  8 ep                      1 FALSE                           100     <tibble [1~
##  9 explicit                2 FALSE                            93.3   <tibble [2~
## 10 key_mode               19 C major                          15.3   <tibble [1~
## 11 key_name               12 C                                17.2   <tibble [1~
## 12 mode_name               2 major                            89.6   <tibble [2~
## 13 track_name            163 'tis the damn season              0.613 <tibble [1~
## 14 track_release          34 2021-04-09                       14.1   <tibble [3~categorical_analysis %>% show_plot()

Observations

Bon Iver is the most featured artist
Most promotional_release was in 2010–11–08
Most songs are from album Fearless
Most releases were on date 2021–04–09
Majority doesn’t have bonus track
C major, G major, D major and F Major seems dominate her songs
Major seems to dominate her music

Detailed Lyrical Data Analysis

tidy_taylor %>%
  count(word, sort = TRUE) %>%
  #filtering to get only the information we want on the plot
  filter(n > 70,
         word != "di",
         word != "ooh",
         word != "ey")%>%
  ggplot(aes(x = reorder(word, n), y = n))+
  geom_bar(stat="identity")+
  geom_text(aes(label = reorder(word, n)), 
            hjust = 1.2,vjust = 0.3, color = "white", 
            size = 5)+
  labs(y = "Number  of times mentioned", 
       x = NULL,
       title = "Most frequent words in Taylor Swift lyrics",
       caption = "@ajaytomgeorge")+
  coord_flip()+
  ylim(c(0, 300))+ # I didn't want to have the bars covering the whole plotting area
  theme_minimal()+
  #now making more visually appealing
  theme(plot.title = element_text( hjust = 0.5,vjust = 3, color = "blue3", size = 14,  family="Forte"),
        axis.text.y = element_blank(),
        axis.text.x = element_text(size = 8, color = "grey40"),
        axis.title.x = element_text(size = 10, color = "grey40"),
        plot.caption = element_text(size = 7.5, color = "grey40"),
        plot.margin=unit(c(2,1,1.5,1.2),"cm"))

Word Frequency Plot

Love and time seems to dominate other words by a large margin
Top 5 words are positive or nuetral

Sentiment Analysis

tay_sentiment <- tidy_taylor%>%
  inner_join(get_sentiments("bing"))%>% 
  count(Album, Track, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)tay_sentiment%>%
  ggplot(aes(reorder(Track, sentiment), sentiment, fill = Album)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~Album, ncol = 3, scales = "free")+
  scale_fill_manual(values = c("skyblue1", "lightgoldenrod1", "mediumorchid3", "red2", "plum1", "slategray","mediumorchid3", "red2", "plum1"))+
  labs(x = NULL,
       y = "Sentiment",
       title = "Taylor Swift's songs ranked by sentiment",
       caption = "                                                                                                                                    ajaytomgeorge")+
  theme_minimal()+
  theme(plot.title = element_text(size = 13, hjust = 0.4, face = "bold"),
        axis.title.y = element_text(hjust = 0.05, size = 7, color = "grey40", angle = 0),
        axis.title.x =  element_text(size = 8, color = "grey40"),
        axis.text.x = element_text(size = 6.5, color = "grey40"),
        axis.text.y = element_text(size = 6.5, color = "grey40"), 
        strip.text = element_text(size = 9, color = "grey40", face = "bold"),
        plot.caption = element_text(size = 7.5, color = "grey40"))+
  coord_flip()

boxplot(tay_sentiment$sentiment,medcol = "red", boxlty = 0, whisklty = 1)

There is an overall tendency of sentiment of her songs is slighty negative with many outliers in very positive and very negative songs
Evermore and Folklore seems to gave negative sentiment songs