Which 28 seasons data tells us about Simpson Characters?

Mohit Singh
datascape
Published in
5 min readMay 9, 2017

28 Seasons Simpsons Data Analysis in R & MS Excel-Character Scripts & IMDB Episode Ratings + Popularity

Main Goals -

Find trends and pattern over time and each seasons

To find how different variables influence IMDB rating over different Episodes and Seasons

Sentiment Analysis over Character scripts (1989–2016)

Descriptive Analysis of Views over seasons

As we can see from here —

S06 has the most total views

-June, July, August has less views

-Season break, new season starts on late August — early September

Descriptive Analysis of Rating over seasons

As we can see from here —

-S08 has the highest rating episode

-S23 has the lowest rating episode

-Earlier seasons have higher rating, views

REGRESSION

-Multiple Regression for IMDB Rating

-Dependent Variable: IMDB Rating

-Independent Variables:

-Season, Number in Season, Number in Series, US Viewers, and IMDB Votes

-Explore Correlation

-Adjusted R-Square: 0.6631 (Good)

-This model accounts for 66.31% of the variance for the data.

-P-Values: All < 0.0001 (Good)

-Outlier Detection: Z-Score Approach, IQR Approach

Correlation: Scatter Plots

Correlation Matrix to identify which factors affect the popularity of a Simpson Episode

Correlation Matrix

Multiple Regression

Multiple Regression

Interpretations of Coefficients

-Constant (6.8448): When all coefficients equal 0, the IMDB rating of an episode is predicted to be 6.8448. However, this does not make sense because when all coefficients equal 0, it means the episode does not exist.

-Season (0.6258): Holding all other variables constant, the IMDB rating of an episode is predicted to increase by 0.6258 for every progressing season (ex. 0.6258 increase from Season 6 to Season 7).

-Number in Season (0.0297): Holding all other variables constant, the IMDB rating of an episode is predicted to increase by 0.0297 for every progressing episode within a season.

-Number in Series (-0.0302): Holding all other variables constant, the IMDB rating of an episode is predicted to decrease by 0.0302 for every progressing episode

-US Viewers (-0.0276): Holding all other variables constant, the IMDB rating of an episode is predicted to decrease by 0.0276 for every additional million viewers who watched the episode.

  • IMDB Votes (0.0009): Holding all other variables constant, the IMDB rating of an episode is predicted to increase by 0.0009 for every additional IMDB vote

Outlier Detection

Z-Score Approach

-Possible Outliers: 25

-Definite Outliers: 2

  • IQR Approach -Outliers: 2

TEXT MINING- Simpsons Character Scripts ( 28 Seasons)

Next Step is Sentiment Analysis of Simpsons Character Scripts of each episode and find out which character spoke what words the most and then see which character being too negative and positive in their speech

Word Clouds — Word Spoken by MARGE, BART AND HOMER respectively
Data Summary — Character Scripts Data

Sentiment Analysis

Most NEGATIVE & POSITIVE Words spoken by HOMER in 28 seasons
Most NEGATIVE & POSITIVE Words spoken by BART in 28 seasons
Most NEGATIVE & POSITIVE Words spoken by MARGE in 28 seasons

MOST NEGATIVE CHARACTER: BART (4.3%)

MOST POSITIVE CHARACTER: MARGE (4%)

#Script for Sentiment Analysis of Character Scripts library('stringr')
library('readr')
library('wordcloud')
library('tm')
library('SnowballC')
library('RWeka')
library('RSentiment')
library('data.table')
library('DT')
simpsons <- read.csv("simpsons_script_lines.csv",stringsAsFactors = F)View(simpsons)str(simpsons)simpsons <- simpsons[,c("raw_character_text","raw_location_text","normalized_text","word_count")]simpsons$word_count <- as.numeric(simpsons$word_count)simpsons <- simpsons[!is.na(simpsons$word_count),]simpsons$raw_location_text <- as.factor(simpsons$raw_location_text)simpsons$raw_character_text <- as.factor(simpsons$raw_character_text)family <- simpsons[simpsons$raw_character_text%in%c("Lisa Simpson","Bart Simpson","Homer Simpson","Marge Simpson","Maggie Simpson"),]View(family)simpsons <- as.data.table(simpsons)family <- as.data.table(family)#Homer Simpsonhomer <- family$normalized_text[family$raw_character_text=="Homer Simpson"]corpus_homer = Corpus(VectorSource(list(homer)))#Remove punctuations etccorpus_homer = tm_map(corpus_homer, removePunctuation)
corpus_homer = tm_map(corpus_homer, content_transformer(tolower))
corpus_homer = tm_map(corpus_homer, removeNumbers)
corpus_homer = tm_map(corpus_homer, stripWhitespace)
corpus_homer = tm_map(corpus_homer, removeWords, stopwords('english'))
#Wordcloud of all words spoken by Homer Simpsonwordcloud::wordcloud(corpus_homer, max.words = 100, random.order = FALSE, col="orange")#frequency of words spokendtm_homer = DocumentTermMatrix(VCorpus(VectorSource(corpus_homer[[1]]$content)))freq_homer <- colSums(as.matrix(dtm_homer))View(freq_homer)#Sentiments - Homersent_homer = calculate_sentiment(names(freq_homer))
sent_homer = cbind(sent_homer, as.data.frame(freq_homer))
str(sent_homer)
View(sent_homer)
sent_pos_homer = sent_homer[sent_homer$sentiment == 'Positive',]
sent_neg_homer = sent_homer[sent_homer$sentiment == 'Negative',]
#Number of positive and Negative Sentiments by Homer Simpsonscat("Negative Sentiments: ",sum(sent_neg_homer$freq_homer)," positive sentiments: ",sum(sent_pos_homer$freq_homer))#Homer positive wordcloudwordcloud(sent_pos_homer$text,sent_pos_homer$freq, min.freq=15,colors=brewer.pal(11,"PiYG"))#Homer negative wordcloudwordcloud(sent_neg_homer$text,sent_neg_homer$freq, min.freq=15,colors=brewer.pal(100,"RdYlBu"))#Bart Simpsonbart <- family$normalized_text[family$raw_character_text=="Bart Simpson"]corpus_bart = Corpus(VectorSource(list(bart)))#Remove punctuations etccorpus_bart = tm_map(corpus_bart, removePunctuation)
corpus_bart = tm_map(corpus_bart, content_transformer(tolower))
corpus_bart = tm_map(corpus_bart, removeNumbers)
corpus_bart = tm_map(corpus_bart, stripWhitespace)
corpus_bart = tm_map(corpus_bart, removeWords, stopwords('english'))
#Wordcloud of all words spoken by Homer Simpsonwordcloud::wordcloud(corpus_bart, max.words = 100, random.order = FALSE, col="turquoise")#frequency of words spokendtm_bart = DocumentTermMatrix(VCorpus(VectorSource(corpus_bart[[1]]$content)))
freq_bart <- colSums(as.matrix(dtm_bart))
View(freq_bart)#Sentiments - Bartsent_bart = calculate_sentiment(names(freq_bart))
sent_bart = cbind(sent_bart, as.data.frame(freq_bart))
View(sent_homer)
newdata <- sent_homer[order(-freq_homer),]
View(newdata)
sent_pos_bart = sent_bart[sent_bart$sentiment == 'Positive',]
sent_neg_bart = sent_bart[sent_bart$sentiment == 'Negative',]
#Number of positive and Negative Sentiments by Homer Simpsonscat("Negative Sentiments: ",sum(sent_neg_bart$freq_bart)," positive sentiments: ",sum(sent_pos_bart$freq_bart))#Bart positive wordcloudwordcloud(sent_pos_bart$text,sent_pos_bart$freq, min.freq=10,colors=brewer.pal(11,"PiYG"))#Bart positive wordcloudwordcloud(sent_neg_bart$text,sent_neg_bart$freq, min.freq=12,colors=brewer.pal(11,"RdYlBu"))#Marge Simpsonmarge <- family$normalized_text[family$raw_character_text=="Marge Simpson"]
corpus_marge = Corpus(VectorSource(list(marge)))
#Remove punctuations etccorpus_marge = tm_map(corpus_marge, removePunctuation)
corpus_marge= tm_map(corpus_marge, content_transformer(tolower))
corpus_marge = tm_map(corpus_marge, removeNumbers)
corpus_marge = tm_map(corpus_marge, stripWhitespace)
corpus_marge = tm_map(corpus_marge, removeWords, stopwords('english'))
#Wordcloud of all words spoken by Homer Simpsonwordcloud::wordcloud(corpus_marge, max.words = 100, random.order = FALSE, col="purple")#frequency of words spokendtm_marge = DocumentTermMatrix(VCorpus(VectorSource(corpus_marge[[1]]$content)))
freq_marge <- colSums(as.matrix(dtm_marge))
View(freq_marge)
str(sent_bart)
#Sentiments - Margesent_marge = calculate_sentiment(names(freq_marge))
sent_marge = cbind(sent_marge, as.data.frame(freq_marge))
View(sent_marge)
str(sent_marge)
sent_pos_marge = sent_marge[sent_marge$sentiment == 'Positive',]
sent_neg_marge = sent_marge[sent_marge$sentiment == 'Negative',]
#Number of positive and Negative Sentiments by MARGE Simpsonscat("Negative Sentiments: ",sum(sent_neg_marge$freq_marge)," positive sentiments: ",sum(sent_pos_marge$freq_marge))
sum(sent_bart$freq_bart)
#Bart positive wordcloudwordcloud(sent_pos_marge$text,sent_pos_marge$freq, min.freq=10,colors=brewer.pal(11,"PiYG"))#Bart positive wordcloudwordcloud(sent_neg_marge$text,sent_neg_marge$freq, min.freq=10,colors=brewer.pal(11,"RdYlBu"))

--

--