The Ultimate Game of Thrones Dataset
Notes: (1) dataset contains spoilers, (2) this is distinctly about the television series, and (3) forgive any tangents below — I’ve tried to collect my thinking below so others can benefit from what I’ve done.
xkcd and Sankey Diagrams
I was a bit late to Game of Thrones, I’ll admit it. For so long I figured there was just too much to catch up on, and although I didn’t experience the week-to-week agony felt by many fans of the HBO series, I thoroughly enjoyed binging it when I finally found 60+ hours of free time.
Like many fans, I also recognized the beauty of the narrative threads in which characters’ paths criss cross each other throughout the Lands of Ice and Fire. As a lover of maps — especially subway maps — I wondered whether anyone had visualized this criss-cross, and so I stumbled into the world of narrative charts most-recently popularized by Randall Monroe’s xkcd charts of movies like Star Wars and The Lord of the Rings (above). These charts are derivatives of Sankey diagrams which are typically used to illustrate the flow of materials/money/other data from one state/location/category to another. Although it was created 30 years before Sankey, many fans of data visualization will be familiar with the diagram of Napoleon’s March by Charles Joseph Minard, perhaps the most well-known example of a Sankey-like diagram showing the movement of people through a defined geography.
The Web did not disappoint me in my search for an xkcd-style map for Game of Thrones. Seasons 1–3 had already been charted by a graphic designer (full map). However, it seemed no one had yet updated the chart through the end of season 6 (where I was picking things up).
A Path Forward
After some additional searching and not having found someone who’d already done this, I figured I had a choice about how to proceed building a data-driven narrative chart for Game of Thrones:
- I could use a force-directed approach, which other folks have done and are doing (see above). It’s certainly a more generalizable approach, especially when there’s no set geography or interest in encoding meaning in a vertical position. OR:
- I could use the specific geography of Westeros and Essos to create a location-specific chart where vertical position actually does have meaning. This is the same approach used in the Season 1–3 chart, where the vertical position follows a line drawn roughly from The Lands of Always Winter in the north to Dorne in the south of Westeros, then east from Pentos to Qarth in Essos.
I had another look to try to find a dataset from which I could build this visualization, and failing, I realized the project would really be two-fold:
- Create datasets for Game of Thrones, and try to make them as complete as possible, then
- Using that data, create a data-driven narrative chart so it will update with new data when seasons 7 and 8 come out.
To collect as much information as possible, I started by scraping the various Game of Thrones pages on IMDB using import.io. That quickly generated a tremendous amount of useful data which was split into two datasets:
episodeAirDate) and cast information,
characters.json. Around that same time, HBO released the Tower of Joy infographic summarizing relationships between various characters, so building on that I captured various relational information in
characters.json including: if they’re
royal, who their
parents are, who they are the
parentsOf, if they are a
guardianOf someone, if they are
guardedBy someone, their
siblings, to whom they are
allies, if someone
abducted them, who they’ve
killed or were
killedBy, and who the character
serves or is
servedBy to augment data on
characterLink (to IMDB),
nickname, and if the character is in the
But what I really wanted was scene-by-scene information: which characters are together in each scene (co-occurrence/co-appearance). So I rewatched all six seasons and typed out some JSON by hand (because why not?). For each
character in a scene, I also included their
name, if they were
alive (or really dead), had a
title (limited to
King), and if they were
born (or really not-yet born).
A Wiki of Ice and Fire and the Game of Thrones Wiki were fantastic resources to figure out who characters are and where scenes were taking place when it wasn’t clear from the context clues in the show. I also used a variety of maps (Map of Essos, Ice and Fire World Map, Regions of Westeros, Season 1 Locations Map, Season 2 Locations Map) to locate each scene with a
location and a
characters.json datasets are available on github along with a few others, including:
locations.json (including a north-to-south arrangement of
characters-houses.json (grouped for styling in the visualization), and
characters-include.json (in case I wanted to only focus on main characters).
Aside: Some Simple Counting
Well, what to do with all that data? In addition to the visualization, here are some counts for seasons 1–6:
Top 10 Characters (by screen time in Seasons 1–6)
- Tyrion Lannister: 27,107 sec = 7 hr 31 min 47 sec
- Jon Snow: 24,781 sec = 6 hr 53 min 1 sec
- Cersei Lannister: 20,545 sec = 5 hr 42 min 25 sec
- Daenerys Targaryen: 19,427 sec = 5 hr 23 min 47 sec
- Sansa Stark: 18,329 sec = 5 hr 5 min 29 sec
- Arya Stark: 17,214 sec = 4 hr 46 min 54 sec
- Jaime Lannister: 15,657 sec = 4 hr 20 min 57 sec
- Jorah Mormont: 13,364 sec = 3 hr 42 min 44 sec
- Theon Greyjoy: 12,234 sec = 3 hr 23 min 54 sec
- Samwell Tarly: 11,500 sec = 3 hr 11 min 40 sec
Bottom 10-ish Characters (by screen time in Seasons 1–6)
- Ironborn in Skiff: 11 sec
- Guymon: 12 sec
- Olly’s Mother: 14 sec
- Tyrell Guard: 14 sec
- Vayon Poole: 15 sec
- Rickard Stark: 15 sec
- Stark Messenger: 18 sec
- Sorcerer: 20 sec
- Fruit Vendor: 20 sec
- Simpson: 21 sec
- Gordy: 21 sec
The data in
episodes.json wasn’t the right structure to readily visualize, so I needed to rearrange and augment it. I’ve included my very-non-force-directed-physics-based strategy below in the hopes it might be useful.
I imagined a 3-dimensional array where the axes are
locations. In that array, the code in
process.js does the following:
- Fill all characters with
0in the first scene in all locations.
- For each scene, if a character is present, enter a
1in the location (for that character in that scene), then fill all other locations (for that character in that scene) with 0’s because a character cannot be in two places at once.
- If a character is dead in a scene (
"alive":false), then enter
0for that character in the following scene (for that character in that same location).
- Fill forward: if a character is not in a location (is a
0), they won’t be in that location until they arrive there. Beginning in the first scene, if a character in a location is
0, look at the next scene — if that value is empty, make it
- Fill backward: if a character is not in a location (is a
0), check to see if they were in that location previously. Beginning in the last scene, if a character in a location is
0, look at the previous scene — if that value is empty, make it
- Any remaining empty values must be
- Count the maximum number of characters in a scene at each particular location. This number will be used to determine the y-position for each character and the y-range of each geographical region.
- Calculate the middle of each geographical region and the y-axis offset for each character in the region (characters in a location during a scene are sorted alphabetically ) then assign each character a y-value for that scene based on all of the characters in that location at that time.
This strategy generates the y-coordinate for a character in a scene, and the x-coordinate is simply related to the overall timestamp of the scene relative to the beginning of the first episode of season 1.
I then rendered the data using d3.js (h/t Mike Bostock, et al) in the visualization below. In true Minard-ian (Minard-esque?) fashion, it includes information about: which characters are in which scene, their location, their family/affilitation, their title (Hand, Khaleesi, etc.), and their death along with information about the season, episode, and episode title. The full interactive visualization is here: https://jeffreylancaster.github.io/game-of-thrones/.
What’s Next & Other Stuff
I plan to continue to collect and visualize the data for seasons 7 and 8 as they are released and the visualization will update accordingly. Although the dataset is already quite rich, there’s still more I could add: the location and ownership of the various important weapons (I’m thinking Valyrian steel and dragonglass), who’s slept with whom (that’s just through season 3) and how many times, or even building in
subLocation to the y-position algorithm described above, for example.
The visualization itself will also continue to develop. I plan to add information for each death (from the
killedBy field in
characters.json), some fancier animations, and maybe even additional visuals pulled from IMDB (via
I’m perhaps most excited to see how other people take, reuse, and add to the data. The data could be used to recreate (and augment) HBO’s Tower of Joy infographic or any one of the many other Game of Thrones infographics, or it could be used to make a supercut of your favorite character’s story arc; the Sansa Stark movie is currently just over 5 hours through the end of season 6! Maybe you could even begin to do predictive analytics on who will die next (and when)? ⚔️
What do you think you could do with this data?
Update (Apr. 2019): Now that the final season has begun, I’m posting a weekly data-driven recap of each episode:
- “Winterfell” (Season 8, Episode 1) Data Visualization Recap
- “A Knight of the Seven Kingdoms” (Season 8, Episode 2) Data Visualization Recap
- “The Long Night” (Season 8, Episode 3) Data Visualization Recap
- “The Last of the Starks” (Season 8, Episode 4) Data Visualization Recap
- “The Bells” (Season 8, Episode 5) Data Visualization Recap
- “The Iron Throne” (Season 8, Episode 6) Data Visualization Recap