Valar Morghulis, a Game of Thrones Death Predictor
Hay una versión en español de este post, aquí
Maybe it’s old news, because some people have already tries to creade predictions models to find out who dies in Game of Thrones (like here, here and here), but I wanted to try it myself, because two things are certain: There are no two models alike, and… I love Game of Thrones!
Game of Data
The first book was published in 1996 and, in 2011, HBO released the first season based on this fisrt book. The television series was so succesful that, in 2015, when the last episode of season 5 was aired, based on book 5, the author hadn’t finished writting the 6th book! To avoid from keeping the fans waiting, HBO partnered with GRRM to write the script directly for the show. This way, seasons 6th and 7th were released and half of the planet is waiting for the 8th season this April 14th (2019).
Since the series is so famous for killing principal characters (It’s true! Yu can’t have a favourite character because he/she wouls die, and slowly, other characters take the lead… and would probably die too), I decided to make a Classification Model in Python, to try to find any rule or pattern and discopver: Who will die on this last season?
To do this, I used the following datasets:
- Information of the 5 books written so far (“Song of Ice and Fire” Wiki): Contains name, title, gender, age, is noble, house, books appeared on and if he/she’s dead in the books
- Iformation on the fisrt 6 seasons (Kaggle): Name, actor, gender, age, house, seasons appeared on, time on screen on each one and if he/she’s dead on the show.
- Technical information on the television series (Mark Needham’s really cool GitHub): Chapter, title, rating, viewers, director and writers.
With this information, (5 books and 6 first seasons = 72 features!) I plan to train a Classification Model. Later, using the data from 7th season, I’ll try to predict the variable “Is dead on the show?” (isDead_shw).
You Know Nothing, Data Scientist
Here are some interesting things I faced during ths stage:
- Since the dataset come from different sources, joining them by name was not easy (I had “Ned Stark” in one dataset and “Eddard Stark” on the other one, for example). In these cases I recommend a library called “FuzzyWuzzy”, that helps you find the closes word to another one in a Pandas Series.
- Each character has variables “house_shw”, the house where his loyalty stands, and “culture_shw”, the house where he was born. For many, these two houses are not the same. Similar happens to variables in the book: “house_bk” and“culture_bk”.
- There are more than 40 houses, according to the books! To simplify the model, I only worked with the 10 house with most characters in, the rest of them were asigned value “Other”. This applies to house-loyalty and house-family.
- In this dataset, characters don’t come back to life. I have the state of them at the end of book 5 and season 6 (S6). If a character comes back to life in season 7 (S7), a new record “alive” will be used. isDead_shw only has values 1–0.
- This dataset only includes human characters, no dragons, direwolves, White Walkers or Children of the Forest.
- Alive characters that suffered no change from S6 to S7 were removed from the training set, otherwise, the model would “memorize” the alive record in S6 and, if it’s the same as the one in S7, the prediction sould still be “alive”, without any kind of analysis.
To generate data from S7, the following assumptions were made (Spoiler Alert!):
- Alive characters in S6 that died in S7: Leona Tyrell, Taena of Myrr, Peter Baelish y Kevan Lannister.
- Characters that changed their loyalty house: Brienne of Tarth (Bolton to Stark), Bronn (Other to Lannister)
- Characters that changed their family house: Jon Snow
- Characters that are assumed alive, since we know nothing from them in S7: Ellaria Sand, Meera Reed
- Characters that are assumed dead, since we know nothing from them in S7: Benjen Stark.
The Things we do for Data
Although the show is famous for “killing a lot of characters”, the truth in the books is different. In the books, there are more details and more characters (more than 2000!) so there are a lot of aliv characters. In the show, we only meet the main characters (Still that’s over 100!) and these are the ones who die, as you can see below in the graphics to the left (percent of dead characters)
Analyzing by season/book (right graphic), we see that, in the beginning (S1), the show seems to follow the book’s story.
In S3, however, we see a great difference in the deceased number. In this season, there’s a chapter (“The Rains of Castamere”) in which, without giving many spoilers, a lot of characters are killed in one scene. In the show, however, you only see the main characters die, the ones you have followed along during all the show. In the books, this scene has much more detail and the number of dead characters we know, is greater.
In S4 happens the other way around: more deaths in the show than in the books. There are two particulas scenes in which we think this happens: Here is when Shae dies (book 3, season 4), but in the show there are othe moments when other prostitutes, that aren’t in the books, die. also, here is when Ygrittte dies (book 3, season 4), along with many of her savage friends that are presented in the show, but not in the books.
Here’s where popularity of the show and of each character starts to seem important. How much liberties are the writers taking in the episodes, movng away from the books? The graphic above also shows, for example, that S6, when there is no book to rely on, is one of the seasons with the highest number of deaths, maybe its for the rating?
As a complete personal opinion, this Data Scientist believes that the resurrection of certain character was more a product of rating, public pressure and social media petitions from fans than the own writer’s idea, but… we’ll never know… it’s a great idea or a new analysis though.
I model and I know things
I addressed the problem using two datasets:
- Characters that appear in the books AND in the show (115 characters, 65 dead)
- Characters that appear in the books, including their data from the show where applies (2011 characters, 65 dead)
Tha number of characters marked isDead_shw=1 (Deceased) is the same, because in both cases we are looking for deaths in the show. Another interesting analysis would be to predict deahs in the books, but we still have time for that.
The following models where executed and these are the results:
It seems like the models work better when using the complete set of data from the books, because they have more rows to determine living characters and separate them from the dead. This decision was not made according to Accuracy measure, because I was dealing with a highly unbalanced dataset, but according to recall (How many times the model predicted death above all characters that really died) because I don’t want to miss any death, and precision (How many times the model said “dead” and got it right), because I don’t want to say a character died when he didn’t, the model would lose credibility.
With these criteria, four models were chosen to makea deeper analysis: Logistic Regression (tv show), Logistic Regression (books), SVM (tv show) and Random Forest (tv show). Now lets check the importance that each of the models give to each of the 72 features.
- Logistic Regression (tv show): The model gives more importance to the time on screen across all seasons, which seems obvious: if a character stops appearing at all, is because they probably died. It also gives a onsideably high importance to time appeared on S2.
- VM and Random Forest (tv show): Give the same importance to almost all variables (very high and very low respectively). They also seem to give a high probabilty of death to characters appearing on S2. Looking deeper on S2, turns out that no particularly important character died on this season and, on he contrary, several characters are introduced, htt will die on “he Red Wedding” (S3).
- Logistic Regression (books): Gives different importance to each feature, it doesn’t concentrate importance on any season in particular, and instead, the highest probability of death would be to the characters belonging to houses: Bolton, Baratheon, Stark and Targaryen… which makes sense. This is the model that will be used to make the prediction.
Predictions are coming…
The chosen model was Logistic Regression based on the book characters. Probability threshold to say “dead” or “alive” is not 50% due to the class imbalance of the characters in the book/show. Threshold is considered, instead, at 81% (This is, characters with probability lower than 81% are most likely to live)
Below there’s an image of the known characters from the show in death probability order…
This could be a spoiler…
Don’t say I didn’t warn you…
No, I don’t like the results of the model either, but these are some of the reasons I believe the model is predicting this way and must be taken into account:
- The model doesn’t involve time, as a feature. The order above shows probability of dying during S8, but not exactly in that order.
- Last seasons, Tyrion Lannister, one of my favourite characters, gave his loyalty to Daenerys Targaryen (A house with high importance in the model), also, I think his popularity is lower than previous seasons because he used to be more clever.
- Last season we found out that Jon Snow is really Aegon Targeryen, which also means his family-house changed and, most certainly, his death probability, increases.
- The image above doesn’t show all characters, just the ones with the highest death probability (>34%). If the character you’re looking for is not in the image, is because he/she has a lower probability and he/she will live!
- According to the model, characters with royal blood, that can still apply for the Iron Throne and that have a lower probability to die are: Bran (Who I think is no longer interested in the position), Gendry (He’s really a Baratheon), I even say on the Web a theory defending HotPie, or… Cersei Lannister.
So this is the main reason I don’t like my model but “That’s the data and it has to be shared”. I really hope I’m wrong.
Update [28–04–2019]: State of the model predictions after episode 3: Right guess average probability of real deaths until now is 75%.
Update [05–05–2019]: State of the model predictions after episode 4: Right guess average probability of real deaths until now is 77%.
Update [12–05–2019]: State of the model predictions after episode 5, the one with the highest number of diseased known characters in this season so far: Right guess average probability of real deaths until now is 73%.
One observation: The model assumed, for it’s training, that Ellaria Sand, locked in King’s Landing’s dungeons was alive, it predicted her death with 52% probability (i.e. alive at the end of the season), but I’m assuming she died in the castle’s collapse even though it wasn’t shown in the episode directly.
Update [19–05–2019]: State of the model predictions after episode 6, the final episode: Right guess average probability of real deaths is 74%.
But since the show is over, we can calculate the final metrics for the model predicting dead and alive characters at the end of the season (That was the target of the model: dead/not dead at the end, not regarding time)
Below you can see the final confusion matrix of the model: The model was very accurate (99.38%) thaks to the high amout of alive characters it hit… which is not surprising given the amount of alive characters in the set: over 1900!
Precision on catching dead characters (64%) was fair enough given the model predicted the death of most of the main characters and some of them had to stay alive to the end of the show. Recall on dead characters (56%) indicates that some of the deaths were surprising, even for the model!
What do we say to the god of Unbeatable Models?…. Not Today
This model can be optimized, of course it can, for starters:
- It was trained using a super-unbalanced dataset: only 65 registers marked as target=1 (diseased) including Edmure Tully… and if you watched the episode closely, he was alive all along! So that could be one of the reasons the model got more False Positives that expected: it was trained to predict a death based on features of a character that was really alive! With the unbalance of classes 1/65 is a lot!
- New characters appeared in this last episode, in the final counsil of people representing the most powerful houses: five characters to be exact, including the new prince of Dorne. Sure, they are alive, but knowing their names and their features during training would have helped the model avoid some False Negatives as well.
- It can take in account time to predict the order of deaths, which is something I always wanted to do and it could be a great idea to work with, maybe for the upcoming books?
King Bran, the Predicted One
No kidding! If we check the model’s original predictions, Bran had a 71% chance of death, which is a lot lower that 81%, our threshold. Of the characters predicted alive (<81%), he’s the first one with royal blood to appear in the lowest predictions.
Personally, I didn’t thought it was possible because he said many times he didn’t wanted the job, so the next possible options where Gendry or Cersei and that’s why I didn’t liked my model (So you see, Data Scientists can be wrong even interpreting their own models, and modeling can help you discover things you didn’t even knew!)
I change my mind: I loved this model, I love the show and I love Data Science!
If you want to take a look at the full analysis, data and code, check out my Github