Which 28 seasons data tells us about Simpson Characters?
28 Seasons Simpsons Data Analysis in R & MS Excel-Character Scripts & IMDB Episode Ratings + Popularity
Main Goals -
Find trends and pattern over time and each seasons
To find how different variables influence IMDB rating over different Episodes and Seasons
Sentiment Analysis over Character scripts (1989–2016)
As we can see from here —
S06 has the most total views
-June, July, August has less views
-Season break, new season starts on late August — early September
As we can see from here —
-S08 has the highest rating episode
-S23 has the lowest rating episode
-Earlier seasons have higher rating, views
REGRESSION
-Multiple Regression for IMDB Rating
-Dependent Variable: IMDB Rating
-Independent Variables:
-Season, Number in Season, Number in Series, US Viewers, and IMDB Votes
-Explore Correlation
-Adjusted R-Square: 0.6631 (Good)
-This model accounts for 66.31% of the variance for the data.
-P-Values: All < 0.0001 (Good)
-Outlier Detection: Z-Score Approach, IQR Approach
Correlation Matrix to identify which factors affect the popularity of a Simpson Episode
Multiple Regression
Interpretations of Coefficients
-Constant (6.8448): When all coefficients equal 0, the IMDB rating of an episode is predicted to be 6.8448. However, this does not make sense because when all coefficients equal 0, it means the episode does not exist.
-Season (0.6258): Holding all other variables constant, the IMDB rating of an episode is predicted to increase by 0.6258 for every progressing season (ex. 0.6258 increase from Season 6 to Season 7).
-Number in Season (0.0297): Holding all other variables constant, the IMDB rating of an episode is predicted to increase by 0.0297 for every progressing episode within a season.
-Number in Series (-0.0302): Holding all other variables constant, the IMDB rating of an episode is predicted to decrease by 0.0302 for every progressing episode
-US Viewers (-0.0276): Holding all other variables constant, the IMDB rating of an episode is predicted to decrease by 0.0276 for every additional million viewers who watched the episode.
- IMDB Votes (0.0009): Holding all other variables constant, the IMDB rating of an episode is predicted to increase by 0.0009 for every additional IMDB vote
Outlier Detection
Z-Score Approach
-Possible Outliers: 25
-Definite Outliers: 2
- IQR Approach -Outliers: 2
TEXT MINING- Simpsons Character Scripts ( 28 Seasons)
Next Step is Sentiment Analysis of Simpsons Character Scripts of each episode and find out which character spoke what words the most and then see which character being too negative and positive in their speech
Sentiment Analysis
MOST NEGATIVE CHARACTER: BART (4.3%)
MOST POSITIVE CHARACTER: MARGE (4%)
#Script for Sentiment Analysis of Character Scripts library('stringr')
library('readr')
library('wordcloud')
library('tm')
library('SnowballC')
library('RWeka')
library('RSentiment')
library('data.table')
library('DT')simpsons <- read.csv("simpsons_script_lines.csv",stringsAsFactors = F)View(simpsons)str(simpsons)simpsons <- simpsons[,c("raw_character_text","raw_location_text","normalized_text","word_count")]simpsons$word_count <- as.numeric(simpsons$word_count)simpsons <- simpsons[!is.na(simpsons$word_count),]simpsons$raw_location_text <- as.factor(simpsons$raw_location_text)simpsons$raw_character_text <- as.factor(simpsons$raw_character_text)family <- simpsons[simpsons$raw_character_text%in%c("Lisa Simpson","Bart Simpson","Homer Simpson","Marge Simpson","Maggie Simpson"),]View(family)simpsons <- as.data.table(simpsons)family <- as.data.table(family)#Homer Simpsonhomer <- family$normalized_text[family$raw_character_text=="Homer Simpson"]corpus_homer = Corpus(VectorSource(list(homer)))#Remove punctuations etccorpus_homer = tm_map(corpus_homer, removePunctuation)
corpus_homer = tm_map(corpus_homer, content_transformer(tolower))
corpus_homer = tm_map(corpus_homer, removeNumbers)
corpus_homer = tm_map(corpus_homer, stripWhitespace)
corpus_homer = tm_map(corpus_homer, removeWords, stopwords('english'))#Wordcloud of all words spoken by Homer Simpsonwordcloud::wordcloud(corpus_homer, max.words = 100, random.order = FALSE, col="orange")#frequency of words spokendtm_homer = DocumentTermMatrix(VCorpus(VectorSource(corpus_homer[[1]]$content)))freq_homer <- colSums(as.matrix(dtm_homer))View(freq_homer)#Sentiments - Homersent_homer = calculate_sentiment(names(freq_homer))
sent_homer = cbind(sent_homer, as.data.frame(freq_homer))
str(sent_homer)
View(sent_homer)sent_pos_homer = sent_homer[sent_homer$sentiment == 'Positive',]
sent_neg_homer = sent_homer[sent_homer$sentiment == 'Negative',]#Number of positive and Negative Sentiments by Homer Simpsonscat("Negative Sentiments: ",sum(sent_neg_homer$freq_homer)," positive sentiments: ",sum(sent_pos_homer$freq_homer))#Homer positive wordcloudwordcloud(sent_pos_homer$text,sent_pos_homer$freq, min.freq=15,colors=brewer.pal(11,"PiYG"))#Homer negative wordcloudwordcloud(sent_neg_homer$text,sent_neg_homer$freq, min.freq=15,colors=brewer.pal(100,"RdYlBu"))#Bart Simpsonbart <- family$normalized_text[family$raw_character_text=="Bart Simpson"]corpus_bart = Corpus(VectorSource(list(bart)))#Remove punctuations etccorpus_bart = tm_map(corpus_bart, removePunctuation)
corpus_bart = tm_map(corpus_bart, content_transformer(tolower))
corpus_bart = tm_map(corpus_bart, removeNumbers)
corpus_bart = tm_map(corpus_bart, stripWhitespace)
corpus_bart = tm_map(corpus_bart, removeWords, stopwords('english'))#Wordcloud of all words spoken by Homer Simpsonwordcloud::wordcloud(corpus_bart, max.words = 100, random.order = FALSE, col="turquoise")#frequency of words spokendtm_bart = DocumentTermMatrix(VCorpus(VectorSource(corpus_bart[[1]]$content)))
freq_bart <- colSums(as.matrix(dtm_bart))View(freq_bart)#Sentiments - Bartsent_bart = calculate_sentiment(names(freq_bart))
sent_bart = cbind(sent_bart, as.data.frame(freq_bart))
View(sent_homer)newdata <- sent_homer[order(-freq_homer),]
View(newdata)sent_pos_bart = sent_bart[sent_bart$sentiment == 'Positive',]
sent_neg_bart = sent_bart[sent_bart$sentiment == 'Negative',]#Number of positive and Negative Sentiments by Homer Simpsonscat("Negative Sentiments: ",sum(sent_neg_bart$freq_bart)," positive sentiments: ",sum(sent_pos_bart$freq_bart))#Bart positive wordcloudwordcloud(sent_pos_bart$text,sent_pos_bart$freq, min.freq=10,colors=brewer.pal(11,"PiYG"))#Bart positive wordcloudwordcloud(sent_neg_bart$text,sent_neg_bart$freq, min.freq=12,colors=brewer.pal(11,"RdYlBu"))#Marge Simpsonmarge <- family$normalized_text[family$raw_character_text=="Marge Simpson"]
corpus_marge = Corpus(VectorSource(list(marge)))#Remove punctuations etccorpus_marge = tm_map(corpus_marge, removePunctuation)
corpus_marge= tm_map(corpus_marge, content_transformer(tolower))
corpus_marge = tm_map(corpus_marge, removeNumbers)
corpus_marge = tm_map(corpus_marge, stripWhitespace)
corpus_marge = tm_map(corpus_marge, removeWords, stopwords('english'))#Wordcloud of all words spoken by Homer Simpsonwordcloud::wordcloud(corpus_marge, max.words = 100, random.order = FALSE, col="purple")#frequency of words spokendtm_marge = DocumentTermMatrix(VCorpus(VectorSource(corpus_marge[[1]]$content)))
freq_marge <- colSums(as.matrix(dtm_marge))View(freq_marge)
str(sent_bart)#Sentiments - Margesent_marge = calculate_sentiment(names(freq_marge))
sent_marge = cbind(sent_marge, as.data.frame(freq_marge))
View(sent_marge)str(sent_marge)
sent_pos_marge = sent_marge[sent_marge$sentiment == 'Positive',]
sent_neg_marge = sent_marge[sent_marge$sentiment == 'Negative',]#Number of positive and Negative Sentiments by MARGE Simpsonscat("Negative Sentiments: ",sum(sent_neg_marge$freq_marge)," positive sentiments: ",sum(sent_pos_marge$freq_marge))
sum(sent_bart$freq_bart)#Bart positive wordcloudwordcloud(sent_pos_marge$text,sent_pos_marge$freq, min.freq=10,colors=brewer.pal(11,"PiYG"))#Bart positive wordcloudwordcloud(sent_neg_marge$text,sent_neg_marge$freq, min.freq=10,colors=brewer.pal(11,"RdYlBu"))