Introducing Game of Thrones Script Search
Check out “The Ultimate Game of Thrones Dataset” if you want to learn about other datasets in this series, and have a look at “32 Game of Thrones Data Visualizations” and “19 More Game of Thrones Data Visualizations” for a bunch of visualizations using those datasets.
Like many people, I’m always on the lookout for new data to work/play with. For the Game of Thrones datasets I’ve been producing (on github), I realized I didn’t yet have a textual dataset of the words characters speak throughout the show. After unsuccessfully looking to see if anyone had already made one, I decided to make my own.
My natural starting point was closed-captioning .srt files, but .srt files are not always super accurate — there are loads online in all sorts of languages without a canonical version for a given episode. Additionally, .srt files don’t include data about who says a particular line unless that character is off-screen, but that nomenclature isn’t even consistent. Closed-captioning .srt files also break up the lines spoken by a character into showable chunks for the screen; keeping text in that format would make it difficult to search for extended lines, phrases, context, etc. The closest thing I could get to scripts in a format I could work with was at genius.com, but their data wasn’t complete (they’d had a server malfunction?), included spelling errors (and US/UK issues), and wasn’t completely clean.
So I compiled the best .srt files I could find along with some scrapes of what was good at genius.com, reformatted them into JSON, rewatched the entire show (for at least the third or fourth time), and built a new dataset (you know, as one does). This dataset of the lines spoken throughout Game of Thrones has character attribution, a language property and translation value if I could find it (thank you wiki.dothraki.org which has Dothraki, Valyrian dialects, and more), timestamps, and more sprinkled in.
I want to share the raw data, I really do, but I’m not quite ready yet. While I was making the dataset, I had a few project ideas that could be revenue-generating, so I decided to hold off on making the dataset public. But like I said, I still want to share it: so I’m announcing a search interface for the entire Game of Thrones script!
First I’ll show the search interface and then I’ll describe how I got there.
Game of Thrones Script Search
Feel free to explore Game of Thrones Script Search.
The site has a simple google-like single-search bar. You can search for text, phrases (in quotes), character names, languages (e.g. Dothraki, “High Valyrian”, etc.), or whatever you’d like.
Aside from a generous list of stop words, you ought to get some results back. This is a search for “Dothraki” which pulls in lines spoken in Dothraki and any mention of the Dothraki:
The colored bar on the left is an indicator of the language of the text spoken on the show, and if it’s not “Common Tongue” (English), then it’ll also say the language in the search result.
You can click on any character name to do a search for the lines spoken by that character, too. Here are the results for “Rakharo”:
Like I said, it’s not super fancy for now, but it is up to date through the current episode (Season 8, Episode 4) as of the time of posting. I tend to add new data within a day or two after an episode, and there are only two more, so I should be keeping to that schedule.
In the future, I’m going to add a simple visual cue for when the line is said in that particular episode, and maybe one day I’ll integrate that as a link to an online timestamped version of the episode.
Building the Script Search
I had originally planned to use Firebase as the datastore for this data and Algolia for search indexing (and general fanciness) based on my experience with Firebase and an article I had seen, but after building this out on a trial Algolia account I ran into Algolia’s community version limitations. As of today, the database has 23,048 entries in it, and that exceeded Algolia’s community version limit of 10,000 entries. I didn’t hear back from Algolia’s customer support about increasing my limit (since this is a non-commercial project), so I moved on…
Around the same time, I was doing work with some friends at Upperline Code (and on Medium). They’re a great organization if you’re looking to get young people interested in, and educated in, computer science. I was helping Upperline build out some curricula using MongoDB as the database underpinning an application built in Flask. Since MongoDB didn’t have the same size limitations as Algolia, I ported over what I had done for Upperline into the search interface above.
The short version of the build is:
- There are some great videos from Pretty Printed on integrating Flask and PyMongo for reading data from and writing data to a MongoDB database using Flask.
- I used MongoDB’s free Atlas database and Compass community version (also free) to host the data and easily upload new data, respectively.
- I used MongoDB’s built-in text indexing for text search.
If you’re interested in hearing more about the build or if you want an explanation of the code, feel free to leave a comment and I’m happy to write a follow-up.
Lastly a word on the data and data structure. Currently, the data looks like this:
"text": "Brown eyes... green eyes... and blue eyes.",
"text": "What do we say to the God of Death?",
name is a character’s full name and
text is pretty self-explanatory, and
e are the start and end times of the spoken words based on .srt data. At some point I’ll need to include an
offset parameter to align the spoken words with the
episodes.json data I’ve written about elsewhere (since some of the .srt data is from the television version and others are from the DVD/Blu-ray versions).
For any lines spoken in another language, the data looks like this:
"lang": "High Valyrian",
"translation": "Lord of Light, cast your light upon us! Lord of Light, defend us! For the night is dark and full of terrors!",
"text": "Āeksios Ōño, aōhos ōñoso īlōn jehikās! Āeksios Ōño, īlōn mīsās! Kesrio syt bantis zōbrie issa se ossȳngnoti lēdys!",
"lang": "High Valyrian",
"translation": "All men must die.",
"text": "Valar morghulis.",
These entries include a
lang property and a
translation property as well.
Lastly, some entries have
"type": "song" if the words were sung.
For now, although I’m not sharing the full raw data (with the words in the right order), I have shared a “bag of words” version of the script on github. This file has entries in the correct order, but the words within an entry are mixed up and timestamps are missing. But for some applications this might be enough for you.
If you’d like to work with the full JSON file, please do get in touch and I’m happy to chat.
Thanks for reading, and I look forward to hearing what you think!