How to use Trifacta, Textacy, and SpaCy to complete Natural Langage Processing tasks on the script from Game of Thrones
Being one of the seemingly few people who hadn’t already started watching Game of Thrones with the series nearing the finale, I become curious about what types of things I could learn about the series quickly using NLP tools.
I came across a data set of all the episodes through season seven here on Kaggle in JSON format. I transformed the data set into CSV format using the free version of Trifacta, a GUI based data wrangling application that suggests transformations based on the structure of the data, and enables the user to create “recipes” or a list of transformations that once complete can be quickly applied to other similar data sets.
As you might expect there are plenty of YouTube videos about Trifacta….
After creating a function to navigate the CSV and to select parts of the script by season and episode, the following is what I found out just using the portion of the script from the first episode of the series.
S01E01 Winter Is Coming.srt
# first five lines...['Easy, boy.',
"What do you expect? They're savages.",
'One lot steals a goat from another lot,',
"before you know it they're ripping each other to pieces.",
"I've never seen wildlings do a thing like this."]
Using the CountVectorizer library were are the most frequent terms. (don is don’t after processing)
king 27
know 20
don 19
father 17
lord 14
Using the SpaCy library…
for entity in SpacyTextObject.ents:
if str(entity).lower() in topWord:
print(entity.text, entity.label_)
These are the entities that were found with the respective labels in the first episode of the first season:
Stark PERSON
Don PERSON
Stark PERSON
Tell PERSON
Tell ORG
Ned PERSON
Ned PERSON
Don PERSON
Don PERSON
Ned PERSON
Ned PERSON
Grace PERSON
Ned PERSON
Stark PERSON
These are Key terms from episode one using Textacy’s SGRank algorithm.
key = textacy.keyterms.sgrank(SpacyTextObject, ngrams=(1, 2, 3, 4, 5, 6), normalize='lemma', window_width=1500, n_keyterms=10, idf=None)[('white walker', 0.11248712594985487),
('Lord Stark', 0.07366235856737678),
('Don t', 0.07046294955086814),
('Jon Arryn', 0.05019068759008505),
('Seven Kingdoms', 0.042246734016313875),
('man', 0.031666117105154046),
('king', 0.027455050532761947),
('boy', 0.02736735107802635),
('Wall', 0.022010328955189125),
('good', 0.020988486072719355)]
Textacy extracting named entities…
textacy.extract.named_entities(SpacyTextObject, include_types=None, exclude_types=None, drop_determiners=True, min_freq=4)
Jon Arryn Jon Arryn Jon Arryn Ned Ned first Dothraki Dothraki Khal Drogo Khal Drogo Khal Drogo Ned first Ned first Jon Arryn Ned Khal Drogo Dothraki Dothraki Dothraki first
Build better voice apps. Get more articles & interviews from voice technology experts at voicetechpodcast.com
I was excited to see the phrase, “Winter is Coming” show up!
textacy.extract.ngrams(SpacyTextObject, 3, filter_stops=True, filter_punct=True, filter_nums=False, include_pos=None, exclude_pos=None, min_freq=3)winter is coming
saw the white
saw the white
Don t look
saw the white
King s Landing
Don t look
King s Landing
don t want
don t want
Winter is coming
Winter is coming
King s Landing
don t want
don t look
s all right
s all right
s all right
Using the following text about Ned Stark…
text = '''Ned is a fictional character. Ned is the lord of Winterfell, an ancient fortress in the North of the fictional continent of Westeros. Though the character is established as the main character in the novel and the first season of the TV adaptation, Martin's plot twist involving Ned shocked both readers of the book and viewers of the TV series.
Ned is the leader of the Stark Family. Ned is a father of six children.'''
The “nlp” object term creates a spaCy document object which is the required input to many Spacy Functions
def summary(sentence, matchWord):
summary = ''
sentobj = nlp(sentence)
sentences = textacy.extract.semistructured_statements(sentobj, matchWord, cue = 'be')
for i, x in enumerate(sentences):
subject, verb, fact = x
summary += 'Fact '+str(i+1) +': '+(str(fact))+" "
return summarysummary(text, 'Ned')
The following statements were parsed from the document:
Fact 1: a fictional character
Fact 2: the lord of Winterfell, an ancient fortress in the North of the fictional continent of Westeros.
Fact 3: the leader of the Stark Family
Fact 4: a father of six children
The Verb Triples method from textacy…
doc = nlp(text)
verbTriples = textacy.extract.subject_verb_object_triples(doc)
for x in verbTriples:
print(x)
(Ned, is, character)
(Ned, is, lord)
(Ned, is, leader)
(Ned, is, father)
The code below selects popular phrases given a set of episodes and phrase length.
#Season 1 - 3 word phrases
popularSeriesphrases(1, 10, 3, 3,df=data)["king's landing",
'thing you need',
'going to die',
"night's watch",
'want to know',
'tried to kill',
"butcher's boy",
"won't hurt",
"king's hand",
'live and die',
"won't let",
'lord of winterfell',
'lannister always pays',
'know what happened',
'stallion who mounts',
'sun and stars']# Season 1 four word phrases
popularSeriesphrases(1, 10, 4, 3,df=data)['men of the night',
'hand of the king',
'hand of the king.,-',
'north of the wall',
'man of the night',
'king in the north!,the',
'king in the north']# season 2 four word phrases popularSeriesphrases(11, 20, 4, 3, df=data)['hand of the king',
'commander of the city',
'sit on the iron',
"won't be able",
'day until the end']# season 2 five word phrases
popularSeriesphrases(11, 20, 5, 2, df=data)['swear it by the old',
'said no harm would come',
"know what it's like",
'think that about their fathers',
"rains weep o'er his halls",
"rains weep o'er his hall",
'serve the starks.,i serve lady']
These are just a few of the things that can be done right out-of-the-box using these powerful libraries. The dataset, dependency file, and complete code can all be found on Github.