Game of Throne’s Dead/Alive Characters Classification using Logistic Regression

Picture source : Indiewire
Valar Morghulis.. Valar Dohaeris..

A High Valyrian language which means “all men must die”, “all men must serve”. George R.R. Martin is likely to make ups and downs in Game of Thrones story by killing important characters. So, here I am working on death characters data of Game of Thrones driven by curiosity but honestly I just wanted to fill my free time in a good way 😆.

Data Pre-processing

As usual, .csv file data imported to a pandas data frame in order to make simpler data processing and visualization step on python.

So, pandas data frame above shows there’s a lot of NaN data. And if I look into data frame info below, there is 917 data in total, but for several columns such as Book Intro Chapter there is 12 missing data or null data, but for Death Year, Book of Death and Death Chapter column if the value is null, it means the character is still alive.

Therefore, I only further process 905 characters that have non-null value in Book Intro Chapter column, in other words the characters that introduced in a certain book chapter.

Data Overview

The data shows whether the characters whom pledged to a certain House die or not, when he/she introduced, appear on certain book chapter and time he/she die. There are 5 book chapter in total stored in the data starting from A Game of Thrones (GoT), then followed by Clash of King (CoK), Storm of Swords (SoS), A Feast for Crow (FfC), and A Dance with Dragons (DwD). The data formed in binary, if the value is 1 then he/she appear on the book chapter, and if 0 he/she doesn’t appear.

First thing I do is to know whether he/she die by convert Death Year, Book of Death and Death Chapter into binary, if those columns are null then the value is 0, and if it is not then the value is 1. By doing that, I can group the characters by whom he/she pledged to and visualize it using histogram, how much characters that have died for each allegiance. The result histogram is shown by a figure below.

The population of GoT characters shown by histogram above, majority of them are not pledged to noble House or joining Night’s Watch and Wildling. Followed by Night’s Watch brothers, as we know Night’s Watch is consisted by men from various Houses either they are volunteering or cast away to regain their honor, then bastards, and thieves its quite reasonable that Night’s Watch is placed number two. Then Stark and Lannister obviously, the one who rule the North and the one who rule the Seven Kingdom, both must have a lot of people pledged to them. And so on.

In terms of death, the most is people that pledged to no one. It is not surprising, since the majority of population are pledged to no one and there must be dead people in it. But, to be more specific about how likely people that pledged to certain House or organization die, it has to be analyzed in death/population ratio. The histogram of it is shown by the picture below.

It shows that Wildling (known as Free Folks) takes the first place among others, seems make sense because they live in North of The Wall which is actually the most hospitable regions in Westeros since it is very cold and there are an ancient race of humanoid ice creatures that aim to banish human race called White Walkers. Next is Night’s Watch, a military order which holds and guards the Wall. Mostly, they die in a battle with Wildling, patroling beyond the Wall and killed by White Walkers. Then followed by Stark, Baratheon and Lannister. Those 3 house are in the middle of a battle fighting for vengeance and claiming the Iron Throne, there is no question that those 3 House sit on third, fourth and fifth place in death ratio.

Features Dependence

To begin the classification using logistic regression, the first I do is define what features will be used as input. The input will be character appearance in each book chapter, nobility and gender. And the output will be whether the characters are dead or alive, obviously.

From the figure above, we can conclude that the features are slightly independence to each other since the heat map color shows in a range between 0,0 to 0,3 approximately.

Classification using Logistic Regression

I split the data and I only used 70% of the data to train the regression (no reason behind that, just random value from myself). The logistic regression module is performed by importing it from scikit learn python module, so it isn’t hard at all to apply it. And let the classification begin!

If you want to know the theory behind logistic regression, you can find it here.

Performance Measurements

Performance of logistic regression on the data, calculated using several metrics such as accuracy, precision, recall and F1-score. Before digging into that, we have to know what is the meaning of true positive (TP), true negative (TN), false positive (FP), and false negative (FN).

Since I do a binary data classification/prediction, there will be 4 type of result (TP,TN,FP,FN). TP and FN is defined as if the prediction and actual data shows the same value and both of them is a yes class for TP and no class for FN. In contrary, TN and FP is defined as if the prediction and actual data shows different value, if the actual data class is yes but predicted data class is no it is called TN and if the actual data class is no but predicted data class is a yes it is known as FP.

First performance metrics is accuracy, it is simply a ratio of correctly predicted observation to the total observations.

Accuracy = TP+TN/TP+FP+FN+TN

The second one is Precision, it is defined as the ratio of correctly predicted positive observations to the total predicted positive observations

Precision = TP/TP+FP

Third one is Recall, it is the ratio of correctly predicted positive observations to the all observations in actual class — yes

Recall = TP/TP+FN

The last one is F1 Score, it is the weighted average of Precision and Recall.

F1 Score = 2*(Recall * Precision) / (Recall + Precision)

The performance metrics for logistic regression result is shown by this figure below.

Death/Alive Classification Result

In a different perspective, if I see the visualization of comparison between classified/predicted data and actual data for death and alive characters whom pledged to a certain House/Organization in a histogram, it will be looked like the figure below.

The left one is comparison histogram for death characters, it shows that there’s a slight difference between prediction and actual data, but there are some good prediction for several allegiance such as Lannister, Greyjoy and Tully since the histogram shows a near overlap bar between prediction and actual data.

The right one is comparison histogram for alive characters, almost all of the houses alive prediction is overlapping actual alive data except Wildling, Baratheon, Night’s Watch and None. It can be concluded that classification/prediction for alive characters is better than dead characters.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

You can find my python code in applying logistic regression to the data here, and the data itself here.