The Ultimate Game of Thrones Dataset
and an Interactive Game of Thrones Narrative Chart
Notes: (1) dataset contains spoilers, (2) this is distinctly about the television series, and (3) forgive any tangents below — I’ve tried to collect my thinking below so others can benefit from what I’ve done.
xkcd and Sankey Diagrams
I was a bit late to Game of Thrones, I’ll admit it. For so long I figured there was just too much to catch up on, and although I didn’t experience the week-to-week agony felt by many fans of the HBO series, I thoroughly enjoyed binging it when I finally found 60+ hours of free time.
Like many fans, I also recognized the beauty of the narrative threads in which characters’ paths criss cross each other throughout the Lands of Ice and Fire. As a lover of maps — especially subway maps — I wondered whether anyone had visualized this criss-cross, and so I stumbled into the world of narrative charts most-recently popularized by Randall Monroe’s xkcd charts of movies like Star Wars and The Lord of the Rings (above). These charts are derivatives of Sankey diagrams which are typically used to illustrate the flow of materials/money/other data from one state/location/category to another. Although it was created 30 years before Sankey, many fans of data visualization will be familiar with the diagram of Napoleon’s March by Charles Joseph Minard, perhaps the most well-known example of a Sankey-like diagram showing the movement of people through a defined geography.
The Web did not disappoint me in my search for an xkcd-style map for Game of Thrones. Seasons 1–3 had already been charted by a graphic designer (full map). However, it seemed no one had yet updated the chart through the end of season 6 (where I was picking things up).
Still others sought to automate the layout/creation of these sorts of narrative charts, and those projects have been shared here, here, here, and here.
A Path Forward
After some additional searching and not having found someone who’d already done this, I figured I had a choice about how to proceed building a data-driven narrative chart for Game of Thrones:
- I could use a force-directed approach, which other folks have done and are doing (see above). It’s certainly a more generalizable approach, especially when there’s no set geography or interest in encoding meaning in a vertical position. OR:
- I could use the specific geography of Westeros and Essos to create a location-specific chart where vertical position actually does have meaning. This is the same approach used in the Season 1–3 chart, where the vertical position follows a line drawn roughly from The Lands of Always Winter in the north to Dorne in the south of Westeros, then east from Pentos to Qarth in Essos.
I had another look to try to find a dataset from which I could build this visualization, and failing, I realized the project would really be two-fold:
- Create datasets for Game of Thrones, and try to make them as complete as possible, then
- Using that data, create a data-driven narrative chart so it will update with new data when seasons 7 and 8 come out.
The Data
To collect as much information as possible, I started by scraping the various Game of Thrones pages on IMDB using import.io. That quickly generated a tremendous amount of useful data which was split into two datasets: episodes.json
(including episodeTitle
, episodeDescription
, episodeLink
, episodeAirDate
) and cast information, characters.json
. Around that same time, HBO released the Tower of Joy infographic summarizing relationships between various characters, so building on that I captured various relational information in characters.json
including: if they’re royal
, who theirparents
are, who they are the parentsOf
, if they are aguardianOf
someone, if they areguardedBy
someone, their siblings
, to whom they are marriendEngaged
, their allies
, if someone abducted
them, who they’ve killed
or were killedBy
, and who the character serves
or is servedBy
to augment data on characterName
, characterLink
(to IMDB), actorName
, actors
(including seasonsActive
), houseName
, nickname
, and if the character is in the kingsguard
.
But what I really wanted was scene-by-scene information: which characters are together in each scene (co-occurrence/co-appearance). So I rewatched all six seasons and typed out some JSON by hand (because why not?). For each character
in a scene, I also included their name
, if they were alive
(or really dead), had a title
(limited to Hand
, Khal
, Khaleesi
, andKing
), and if they were born
(or really not-yet born).
A Wiki of Ice and Fire and the Game of Thrones Wiki were fantastic resources to figure out who characters are and where scenes were taking place when it wasn’t clear from the context clues in the show. I also used a variety of maps (Map of Essos, Ice and Fire World Map, Regions of Westeros, Season 1 Locations Map, Season 2 Locations Map) to locate each scene with a location
and a subLocation
.
The resulting episodes.json
and characters.json
datasets are available on github along with a few others, including: locations.json
(including a north-to-south arrangement of location
and sublocation
), characters-houses.json
(grouped for styling in the visualization), and characters-include.json
(in case I wanted to only focus on main characters).
Aside: Some Simple Counting
Well, what to do with all that data? In addition to the visualization, here are some counts for seasons 1–6:
Top 10 Characters (by screen time in Seasons 1–6)
- Tyrion Lannister: 27,107 sec = 7 hr 31 min 47 sec
- Jon Snow: 24,781 sec = 6 hr 53 min 1 sec
- Cersei Lannister: 20,545 sec = 5 hr 42 min 25 sec
- Daenerys Targaryen: 19,427 sec = 5 hr 23 min 47 sec
- Sansa Stark: 18,329 sec = 5 hr 5 min 29 sec
- Arya Stark: 17,214 sec = 4 hr 46 min 54 sec
- Jaime Lannister: 15,657 sec = 4 hr 20 min 57 sec
- Jorah Mormont: 13,364 sec = 3 hr 42 min 44 sec
- Theon Greyjoy: 12,234 sec = 3 hr 23 min 54 sec
- Samwell Tarly: 11,500 sec = 3 hr 11 min 40 sec
Bottom 10-ish Characters (by screen time in Seasons 1–6)
- Ironborn in Skiff: 11 sec
- Guymon: 12 sec
- Olly’s Mother: 14 sec
- Tyrell Guard: 14 sec
- Vayon Poole: 15 sec
- Rickard Stark: 15 sec
- Stark Messenger: 18 sec
- Sorcerer: 20 sec
- Fruit Vendor: 20 sec
- Simpson: 21 sec
- Gordy: 21 sec
The Visualization
The data in episodes.json
wasn’t the right structure to readily visualize, so I needed to rearrange and augment it. I’ve included my very-non-force-directed-physics-based strategy below in the hopes it might be useful.
I imagined a 3-dimensional array where the axes are scenes
, uniqueCharacters
, and locations
. In that array, the code in process.js
does the following:
- Fill all characters with
0
in the first scene in all locations. - For each scene, if a character is present, enter a
1
in the location (for that character in that scene), then fill all other locations (for that character in that scene) with 0’s because a character cannot be in two places at once. - If a character is dead in a scene (
"alive":false
), then enter0
for that character in the following scene (for that character in that same location). - Fill forward: if a character is not in a location (is a
0
), they won’t be in that location until they arrive there. Beginning in the first scene, if a character in a location is0
, look at the next scene — if that value is empty, make it0
. - Fill backward: if a character is not in a location (is a
0
), check to see if they were in that location previously. Beginning in the last scene, if a character in a location is0
, look at the previous scene — if that value is empty, make it0
. - Any remaining empty values must be
1
's. - Count the maximum number of characters in a scene at each particular location. This number will be used to determine the y-position for each character and the y-range of each geographical region.
- Calculate the middle of each geographical region and the y-axis offset for each character in the region (characters in a location during a scene are sorted alphabetically ) then assign each character a y-value for that scene based on all of the characters in that location at that time.
This strategy generates the y-coordinate for a character in a scene, and the x-coordinate is simply related to the overall timestamp of the scene relative to the beginning of the first episode of season 1.
I then rendered the data using d3.js (h/t Mike Bostock, et al) in the visualization below. In true Minard-ian (Minard-esque?) fashion, it includes information about: which characters are in which scene, their location, their family/affilitation, their title (Hand, Khaleesi, etc.), and their death along with information about the season, episode, and episode title. The full interactive visualization is here: https://jeffreylancaster.github.io/game-of-thrones/.
What’s Next & Other Stuff
I plan to continue to collect and visualize the data for seasons 7 and 8 as they are released and the visualization will update accordingly. Although the dataset is already quite rich, there’s still more I could add: the location and ownership of the various important weapons (I’m thinking Valyrian steel and dragonglass), who’s slept with whom (that’s just through season 3) and how many times, or even building in subLocation
to the y-position algorithm described above, for example.
The visualization itself will also continue to develop. I plan to add information for each death (from the killedBy
field in characters.json
), some fancier animations, and maybe even additional visuals pulled from IMDB (via actorLink
in characters.json
).
I’m perhaps most excited to see how other people take, reuse, and add to the data. The data could be used to recreate (and augment) HBO’s Tower of Joy infographic or any one of the many other Game of Thrones infographics, or it could be used to make a supercut of your favorite character’s story arc; the Sansa Stark movie is currently just over 5 hours through the end of season 6! Maybe you could even begin to do predictive analytics on who will die next (and when)? ⚔️
What do you think you could do with this data?
The full project code is on github, and I’m happy to answer any questions about it here or there.
p.s. I like these, too: Westeros Transit Map and Known World Transit Map.
Update (Feb. 2019): I’ve now written a follow-up to this: “32 Game of Thrones Data Visualizations”. Enjoy!
Update (Apr. 2019): And here’s another follow-up: “19 More Game of Thrones Data Visualizations”.
Update (May 2019): “Introducing Game of Thrones Script Search”
Update (Apr. 2019): Now that the final season has begun, I’m posting a weekly data-driven recap of each episode:
- “Winterfell” (Season 8, Episode 1) Data Visualization Recap
- “A Knight of the Seven Kingdoms” (Season 8, Episode 2) Data Visualization Recap
- “The Long Night” (Season 8, Episode 3) Data Visualization Recap
- “The Last of the Starks” (Season 8, Episode 4) Data Visualization Recap
- “The Bells” (Season 8, Episode 5) Data Visualization Recap
- “The Iron Throne” (Season 8, Episode 6) Data Visualization Recap