Game of Thrones: Exploratory and Sentiment Analysis

How did Tyrion Lannister ‘Dominate’ the whole series?

Alben Tumanggor
Analytics Vidhya
13 min readNov 22, 2019

--

Photo by Kylo on Unsplash

Who doesn’t know “Game of Thrones”???

Well actually, there are still many people who don’t have any idea about what is ‘Game of Thrones’. For this very reason, my so little explanation is coming to the rescue.

Game of Thrones is ‘infamous’ TV series from HBO. This series has been aired from 17 April 2011 to 19 May 2019, and in my opinion is one of the best TV series in the last couple of years.

I wrote this article to breakdown the whole script of this TV series with a goal “To exploit information that we couldn’t get by watching them on television”.

Who really are the important characters?

What is the most frequently used couple of letters by those characters?

And how are their feelings throughout the long winter?

We will certainly answer those 3 questions above using analysis. Step by step, I will explain about what I’ve done in extracting bunch of information that was implicitly stored in this series’ script.

Of course, this analysis is focused on NLP (Natural Language Processing). The whole process was done by utilizing several fancy open source packages available for data analysis purpose out there. Hence, it would be great if you have fundamental knowledge on NLP so you can enjoy the contents.

Let’s us have a similar thought that this is going to be a long journey, and you might want to grab a cup of your favorite coffee to enjoy the whole part of this article.

Image from https://gameofthrones.fandom.com/

Where is The Data?

Before I jump straight to the analysis, it would be better if I explain about how I generate dataset that I will be using on the entire article.

When we are talking about NLP (Natural Language Processing), there are large quantity of available data, either structured or unstructured, distributed from wide range of source, such as IMDb, Kdnuggets, Kaggle Datasets, etc. Further, we can as well do web scrapping to collect and compile data that scattered all over the internet to produce our desired dataset. As for this analysis, I generated my dataset by doing some web scrapping on “Game of Thrones” script provided on Genius.com. However, the entire process on scrapping and cleaning dataset is not provided here since the process itself was long enough to be written as its own dedicated article.

Fortunately, I already submitted a kernel on Kaggle.com about how I did the web scrapping and data cleansing to generate the “Game of Thrones Script” dataset. If you are curious about the whole process on collecting and compiling the scattered and dirty data, you can access the kernel through my Kaggle profile on https://www.kaggle.com/albenft. Besides, you may as well download the dataset provided there to be used for your own analysis.

Now, let’s take a look on the dataset.

import pandas as pd
script = pd.read_csv('Game_of_Thrones_Script.csv')
script.head()
Top 5 from Dataset

Based on sneak peek above, you can see that our dataset contains 6 different columns which each characteristic will be explained below.

script.info()
Dataset Information
  • Release Date: String, formatted on ‘yyyy-MM-dd’, contains value of the episode’s original air date.
  • Season: String, contains value of the series’ season.
  • Episode: String, contains value of serial number of episode.
  • Episode Title: String, contains value of the episode’s title.
  • Name: String, contains value of character’s name.
  • Sentence: String, contains full text of character’s conversation.

To summarize, our dataset contains 6 columns that have different values for various purposes and has size of 23,908 rows.

Top Characters

Photo by Michael Mazzone on Unsplash

There are relatively wide range of consideration and perspective on defining the importance of a character in a story. For example, we can examine their impact on the story, how the story revolves around them, the importance of actions they have taken, and so on. Unfortunately, on this analysis we do not have those fancy information to provide an accurate result, instead we only have a complete collection of sentences and conversations spoken by each character. Therefore, the importance of a character on this analysis was done only by calculating number of words they spoken on the entire series. There are indeed a lot of biases that can be inspected on this particular method, but at the moment it was the most proper way I could do by utilizing our dataset.

As for the process on defining top characters, I have been exploiting an NLP related package called NLTK. NLTK (Natural Language Toolkit) is a powerful package that can do so much stuffs related to Text Processing. One of its feature that I used on this process is a function called word_tokenize to segregate each word in a sentence which later will be used for counting.

from nltk import word_tokenizescript['Tokenize Words'] = script['Sentence'].apply(lambda x: word_tokenize(x))
script['Tokenize Words Alphanumeric Only'] = script['Tokenize Words'].apply(lambda x: [item for item in x if item.isalnum()])
script['Sentence Word Count'] = script['Tokenize Words Alphanumeric Only'].apply(len)
script.head()
Word Count Generated

After we got count of word on sentence spoken by a character and save them in a single column, top characters can be generated by simply accumulating those word count grouped by each unique character. Technically, the process is done by grouping the data on column Name and accumulate their Sentence Word Count to get the total words. We can do it as follows.

import matplotlib
import matplotlib.pyplot as plt
import numpy as np
PLOT_STYLE = 'seaborn-darkgrid'# get total word count for each character
characters_words = script.groupby(['Name'])['Sentence Word Count'].sum().reset_index().sort_values(by=['Sentence Word Count'], ascending=[0])
# limit the data to only top 20 characters with most words
names = np.array(characters_words.head(20)['Name'].tolist())
word_count = np.array(characters_words.head(20)['Sentence Word Count'].tolist())
plt.style.use(PLOT_STYLE)
fig, ax = plt.subplots(figsize=(15, 12))
y_pos = np.arange(len(names))ax.barh(y_pos, word_count, align='center')
ax.set_yticks(y_pos)
ax.set_yticklabels(names, size=14)
ax.tick_params(axis='x', labelsize=12)
ax.invert_yaxis() # labels read top-to-bottom
ax.set_xlabel('Word Count', size=14)
ax.set_title('Which Character Has The Longest Line?', size=24, weight=500, ha='center')
plt.show()
Top 20 Characters by Word Count

We can get a brief idea about who are the most important characters on “Game of Thrones” based on number of words they spoken in the entire series from graph above. Those top characters are dominated by Lannister and Stark families, with Tyrion Lannister and Cersei Lannister as top 2 characters who have the longest line.

Besides knowing which characters have the longest line in entire series, it would be great if we can as well see how a character grows throughout the seasons. For this particular purpose, I did some extra miles to make bar chart race animation using animation from matplotlib package.

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import matplotlib.animation as animation
from IPython.display import HTML
script['Row'] = script.index
max_row = script['Row'].max()
# I only give unique color to families that appear often on top chart
family_names = ['lannister','targaryen','snow','stark','baelish','greyjoy','mormont','baratheon','tyrell','clegane','other']
color_selections = ['#cf9e23','#cfcfcf','#cfcfcf','#a3cf5a','#b17d5e','#449a5e','#f57b5f','#6ab0de','#7081ec','#b213f5','#60b4b5']
def get_family_name(x):
name_split = x.split(' ')
if len(name_split) > 1 and name_split[1] in family_names[:-1]:
return name_split[1]
else:
return family_names[-1]
script['Family Name'] = script['Name'].apply(get_family_name)# create dictionary for family color
colors = dict(zip(
family_names,
color_selections
))
family_group = script.set_index('Name')['Family Name'].to_dict()
# functions to generate chart
def draw_chart(offset):
temp_script = script.iloc[:offset]
dff = temp_script.groupby(['Name'])['Season','Sentence Word Count'].agg({'Season':'max','Sentence Word Count':'sum'}).reset_index().sort_values(by=['Sentence Word Count'], ascending=[0]).head(10)
season = dff['Season'].max()
episode = temp_script[temp_script['Season'] == season]['Episode'].max()
ax.clear()names = dff['Name'].values
word_count = dff['Sentence Word Count'].values
y_pos = np.arange(len(names))
ax.barh(y_pos, word_count, color=[colors[family_group[name]] for name in names])
ax.set_yticks(y_pos)
ax.invert_yaxis() # labels read top-to-bottom
dx = dff['Sentence Word Count'].max() / 400
# iterate over the values to plot labels and values
for i, (value, name) in enumerate(zip(dff['Sentence Word Count'], dff['Name'])):
if family_group[name] != 'other':
ax.text(value-dx, i, name.split(' ')[0], size=15, weight=800, ha='right', va='bottom')
ax.text(value-dx, i+.25, family_group[name], size=12, color='#444444', ha='right', va='baseline')
else:
ax.text(value-dx, i, name, size=15, weight=700, ha='right', va='bottom')
ax.text(value-dx, i+.25, '', size=12, color='#444444', ha='right', va='baseline')
ax.text(value+dx, i, f'{value:,.0f}', size=15, ha='left', va='center', weight=200, color='#777777')
# styling chart
ax.text(1, 0.2, season, transform=ax.transAxes, color='#777777', size=46, ha='right', weight=600)
ax.text(1, 0.1, episode, transform=ax.transAxes, color='#777777', size=30, ha='right', weight=600)
ax.text(0, 1.06, 'Cumulative Word Count', transform=ax.transAxes, size=14, color='#777777')
ax.xaxis.set_major_formatter(ticker.StrMethodFormatter('{x:,.0f}'))
ax.xaxis.set_ticks_position('top')
ax.tick_params(axis='x', colors='#777777', labelsize=11)
ax.set_yticks([])
ax.margins(0, 0.01)
ax.grid(which='both', axis='x', linestyle=':', linewidth=0.5, c='grey')
ax.set_axisbelow(True)
ax.text(0, 1.15, '-', transform=ax.transAxes, size=24, weight=600, ha='left', color='w')
ax.text(0, 1.12, 'Character\'s Total Word Count Entire Seasons', transform=ax.transAxes, size=24, weight=600, ha='left')
ax.text(1, 0, '@albenft', size=12, transform=ax.transAxes, ha='left', color='#777777', bbox=dict(facecolor='white', alpha=0.8, edgecolor='white'))
plt.box(False)
# create new fig, and generate animation using animator
fig, ax = plt.subplots(figsize=(11, 10))
animator = animation.FuncAnimation(fig, draw_chart, frames=range(100, max_row, 30))
HTML(animator.to_jshtml())
Word Count Bar Chart Race

Bar chart race above provides us full journey on appearances of each character the entire season. We can see either which character is gradually getting their domination as the series go on or which character has already been dominating the series since the beginning. Overall, Lannister family dominated the entire season of ‘Game of Thrones’ in term of number of words spoken.

WordCloud

Hodor’s WordCloud

WordCloud is another powerful package in Text Analysis. What it can do is simply showing a set of words in form of cloud, with different size and proportion depending on how often a word appears in a collection of words. On this analysis, I made a set of WordClouds from conversations of our Top 20 Characters. The goal is to get any idea about the contexts that been talked by each character and which vocabularies they frequently used.

One thing to seriously consider in creating WordCloud is to eliminate stopwords. Despite being the most often words to appear on our dataset, stopwords do not provide any useful information or context on a topic. Therefore, we should clean our dataset from stopwords so that we can get a clear information and context spoken by each character.

There are a lot of open source packages that provide set of stopwords for the purpose of text analysis. I used collection of stopwords provided by scikit-learn, precisely ENGLISH_STOP_WORDS, and union them with some of stopwords that I discovered on our dataset for this analysis. This process of eliminating stopwords really have big impact and differences on the produced WordClouds.

from wordcloud import WordCloud
import nltk
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
# get list of top 20 characters
top_20 = script.groupby(['Name'])['Sentence Word Count'].sum().reset_index().sort_values(by=['Sentence Word Count'], ascending=[0]).head(20)['Name'].tolist()
# create my own set of stopwords
my_stop_words = ENGLISH_STOP_WORDS.union(['did','does','ca','don','wo','men','man','ll',
'want','oh','yes','doing','going','like','ser',
'eh','thing','aye','ve','just'])
# function to generate and show the wordclouds
def generate_word_cloud(character):
character_name = character
list_words = script[script['Name'] == character_name]
list_words = list_words['Tokenize Words Alphanumeric Only'].tolist()
words = []
for i in list_words:
words.extend(i)
words = ''.join(i.lower() + ' ' for i in words)
my_word_cloud = WordCloud(background_color='white', stopwords=my_stop_words).generate(words)fig_title = ''.join(i.capitalize() + ' ' for i in character_name.split(' ')) + 'Wordcloud'
fig, ax = plt.subplots(figsize=(11, 11))
plt.imshow(my_word_cloud, interpolation='bilinear')
ax.text(0, 1.12, fig_title, transform=ax.transAxes, size=24, weight=600)
plt.axis('off')
plt.show()
# iterate over top 20 characters to generate wordclouds
for c in top_20:
generate_word_cloud(c)
WordClouds of Top 20 Characters

Various kind of WordClouds are generated as a result from different topics that have been spoken by each character. Some characters indeed have their own unique behavior, for example Jorah Mormont, we can see that he is really loyal to Daenerys Targaryen based on how often he used the word Khaleesi. Despite this uniqueness, some words are still commonly used by most of the characters, such as know, lord, father, etc.

As bonus, first figure on this section, below the title, is indeed a WordCloud of a character named Hodor.

Sentiment Analysis

Photo by Lidya Nada on Unsplash

It is probably not a very entertaining activity if we do Text Analysis without Sentiment Analysis. It is a process in which we mine meaningful patterns from text data. Sentiment Analysis can be performed all over the document that we have to extract polarity and subjectivity on a topic which aims to classify the attitude as positive, negative, and neutral.

Fairly speaking, there are also a lot of different ways in determining whether a sentiment of a sentence or collection of sentences is considered as positive, negative, or neutral. For this analysis, I used a package named TextBlob to score each sentence spoken by every unique character on our dataset. Scores provided by TextBlob consist of two values which are polarity and subjectivity. Polarity score lies between -1 to 1 which define the attitude as positive, negative, or neutral in a statement, while subjectivity score lies between 0 to 1 referring to personal opinion, emotion, or judgement. However, we will only make use of polarity score to support this analysis.

As for the labeling, here are the rules that I used for determining positive, negative, and neutral sentiment.

positive sentiment : polarity ≥ +0.5
negative sentiment : polarity ≤ -0.5
neutral sentiment : -0.5 < polarity < +0.5

Let’s ignore the rules above for now. Using simple boxen plot, we can explore the distribution of polarity in every sentence spoken by our Top 20 Characters.

from textblob import TextBlob
import matplotlib.pyplot as plt
import seaborn as sns
script['Sentiment Scores'] = script['Sentence'].apply(lambda x: TextBlob(x).sentiment)
script['Polarity'] = script['Sentiment Scores'].apply(lambda x: x[0])
script['Subjectivity'] = script['Sentiment Scores'].apply(lambda x: x[1])
fig, ax = plt.subplots(figsize=(25,11))
ax.tick_params(axis='x', labelsize=16, color='#777777')
ax.tick_params(axis='y', labelsize=16, color='#777777')
sns.boxenplot(x='Name', y='Polarity', data=top_20_char[top_20_char['Name'].isin(top_20[:10])])
plt.show()
fig, ax = plt.subplots(figsize=(25,11))
ax.tick_params(axis='x', labelsize=16, color='#777777')
ax.tick_params(axis='y', labelsize=16, color='#777777')
sns.boxenplot(x='Name', y='Polarity', data=top_20_char[top_20_char['Name'].isin(top_20[10:])])
plt.show()
Top 20 Character Sentences’ Polarities (1)
Top 20 Characters Sentences’ Polarities (2)

We can clearly see that polarities in sentences spoken by each character are mostly distributed near the value of zero. However, there are also slight of skewness displayed on those polarity distribution, which most of them are positively skewed. Although it is minor, we can say that most of the sentence spoken by Top 20 Characters have relatively positive sentiment.

Back on the polarity rules again, I used those rules to count every sentiment label of each character. Result from this counting was then being calculated again to get the proportion of each positive and negative label in character’s overall sentence. Last, this proportion result is the one I used to rank our Top 20 Characters on positive and negative sentiment.

script['Positive Polarity'] = script['Polarity'].apply(lambda x: 1 if x >= 0.5 else 0)
script['Negative Polarity'] = script['Polarity'].apply(lambda x: 1 if x <= -0.5 else 0)
script['Neutral Polarity'] = script['Polarity'].apply(lambda x: 1 if x > -0.5 and x < 0.5 else 0)
# count each polarity labels
char_polarities = script.groupby(['Name'])['Positive Polarity','Negative Polarity','Neutral Polarity'].sum().reset_index()
# get proportion of each label
char_polarities['Positive Polarity Rate'] = char_polarities['Positive Polarity'] / (char_polarities['Positive Polarity'] + char_polarities['Negative Polarity'] + char_polarities['Neutral Polarity'])
char_polarities['Negative Polarity Rate'] = char_polarities['Negative Polarity'] / (char_polarities['Positive Polarity'] + char_polarities['Negative Polarity'] + char_polarities['Neutral Polarity'])
top_5_positive = char_polarities[char_polarities['Name'].isin(top_20)].sort_values(by=['Positive Polarity Rate'], ascending=[0]).head(5)top_5_negative = char_polarities[char_polarities['Name'].isin(top_20)].sort_values(by=['Negative Polarity Rate'], ascending=[0]).head(5)
Top 5 Positive Polarity Proportion from Top 20 Characters
Top 5 Negative Polarity Proportion from Top 20 Characters

Notice that there is one character ‘Olenna Tyrell’ included in both ranks. Still, we managed to get rank of most positive and most negative characters from our Top 20 Characters.

In addition, here are some high polarity scored on positive and negative sentences spoken by characters above.

Positive polarities:

Cersei Lannister: “Your daughter is a beauty too. Brown eyes, those lips. A perfect Dornish beauty.”

Olenna Tyrell: “You’ve always been rather impressed with yourself, haven’t you?”

Robb Stark: “Edmure is the best match a Frey has had in the history of their house. We should all get some sleep.”

Tyrion Lannister: “Find the best builders and set them to the task.”

Varys: “For now is the best we get in our profession.”

Negative polarities:

Brienne: “Do you take me for an idiot? In!”

Bronn: “That was something stupid.”

Olenna Tyrell: “Idiots, help your king.”

Sansa Stark: “Well, shouldn’t they be once the real cold comes?”

Theon Greyjoy: “Wait. Wait, wait, wait. I took it because I hated the Starks. I hated them for holding me prisoner. I wanted to hurt them.”

Process on Text Analysis might take a lot of time and effort which most of them are spent on pre-processing the data. In fact, this rule is applied in every kind of data mining activities. Fortunately, we have group of good people who are always kindly contributing on open source libraries to make our job easier.

Various kind of knowledge and information can be produced from Text Analysis activity. This article exploited only very tiny part of those many activities that can be done in NLP. The only limitation in making great and deep analysis is our perspectives and creativity.

Finally, to answer the question on subtitle, YES! Tyrion Lannister dominated the whole series by being a character who has the longest line. He did talk a lot there.

Valar Morghulis

--

--

Alben Tumanggor
Analytics Vidhya

Newbie Data Scientist who is currently kind of putting interest on NLP