Hogwarts Housing with Logistic Regression
--
This is one of the most famous scenes in the Harry Potter movies where Sorting Hat places Harry Potter in one of the Hogwart’s houses, based on his specific features.
Have you ever considered how it actually works? Is this a real magic or a determined algorithm? To my mind, science always takes place behind the magic, so that Sorting Hat is more than a classification algorithm rather than magic.
So, in this tutorial, I tried to implement a Sorting Hogwarts Hat with Logistic Regression on PyTorch for Harry Potter characters based on their features.
Are you Gryffindor? Are you Hufflepuff? Are you Slytherin? Are you Ravenclaw? How is it possible to predict that? In this post, I’ll try to cover all these things. So, don’t talk the talk, but walk the walk :)
Let’s start!
Data loading
Sorting Hat uses some prior knowledge to make its decision. From a Machine Learning perspective we could consider this process as a classification task: given some information about Hogwarts students (the data, input) and which house they belonged to (the label, target). Let’s upload the data and investigate it.
df_train = pd.read_csv('dataset_train.csv')
As we can see this dataset needs some preprocessing. There are some features with missing values, also there is a mix of categorical and numerical features.
Assume, that our input features will be Birthday
, Best Hand
, Arithmancy
, Astronomy
, Herbology
, Defense Against the Dark Arts
, Divination
, Muggle Studies
, Ancient Runes
, History of Magic
, Transfiguration
, Potions
, Care of Magical Creatures
, Charms
, Flying
, and the target feature, of course, will be Hogwarts House
.
We can get down to classification, but let’s look at the data more thoughtfully.
“Curiosity is not a sin,” he said. “But we should exercise caution with our curiosity… yes, indeed….” — Albus Dumbledore
Looking back at df_train.info()
we can notice that there are no unlabeled items in the target feature. Good news! But we have some missing values in numerical features — Arithmancy
, Astronomy
, Herbology
, Defense Against the Dark Arts
, Divination
, Muggle Studies
, Ancient Runes
, History of Magic
, Transfiguration
, Potions
, Care of Magical Creatures
. We can fill missing values with the mean value of corresponding columns. Categorical features such as Best Hand
and Hogwarts House
should be converted into numerical.
But firstly, let’s look at relations between features and their distributions.
So here we look at the pair plot and see what variables are useful to group people by. There is no use to use all features as some of them don’t provide an obvious division into groups and can bring only noise. So far as we are looking for four clearly different groups, there Astrology
, Herbology
, Defense Against the Dark Arts
, Ancient runes
, Charms ish
, Flying
.
Two features — Defense Against the Dark Arts
and Astronomy
- are interesting and at first glance they look quite similar. Let's take a look at them.
Here we see that these features are similar, and moreover one of them is a result of multiplication of the other on a specific number:
df_train['Astronomy'] / df_train['Defense Against the Dark Arts']
Out[14]:
0 -100.0
1 -100.0
2 -100.0
3 -100.0
4 NaN
...
1595 -100.0
1596 -100.0
1597 -100.0
1598 -100.0
1599 -100.0
Length: 1600, dtype: float64
We can drop one of them, let it be Astronomy
.
Now we are ready to apply some preprocessing and clean up to our dataset. Firstly, we need to drop useless columns, fill missing values in features with mean values, convert categorical features to numerical, convert string date to separated day, month, and year features. After that, we need to split data into train, test, and validation sets and standardize numerical values. I will provide all preprocessing steps in my Jupyter notebook on Jovian.ml, here is the resulted dataframe:
Finally, let’s look at a correlation between features via a correlation matrix which tells us which variables play a more important role in predicting the Hogwarts House:
So we have almost prepared dataframe for the training process. Last steps — convert dataframe to NumPy arrays and then convert them into PyTorch tensors. Further details you can find in the Jupyter notebook.
Next, we need to create PyTorch datasets & data loaders for training & validation.
# Dataset
train_dataset = TensorDataset(train_inputs, train_targets)
val_dataset = TensorDataset(val_inputs, val_targets)
test_dataset = TensorDataset(test_inputs, test_targets)
# Dataloaders
train_loader = DataLoader(train_dataset, batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size*2)
test_loader = DataLoader(test_dataset, batch_size*2)
Resulted data loader inside looks like:
And we are ready to implement a model.
Model
Create fit and evaluate methods for training and validation:
Training parameters:
# Hyperparmeters
batch_size = 64
learning_rate = 1e-3
# Other constants
input_size = 16
num_classes = 4
Here we go! Define a model and make a fit — here comes the magic.
model = HousingModel()
history = fit(150, learning_rate, model, train_loader, val_loader)
After 150 epochs the model performed results demonstrated on plots below.
In this training phase, the model was trained for 150 epoch or 150 iterations with a learning rate of 1e-3. Because the model parameters are randomly initialized, it is expected that at the beginning of the training process the accuracy is very low and the validation loss is high. As expected, during training loss has decreased and accuracy has increased and reached ~99.479% which is quite awesome.
Evaluation
Now the model has completed the training phase and it can be used for making a prediction. Let’s evaluate the validation and the test sets.
As we can see, achieved results and accuracy on train and validation sets are pretty close and good enough:
'val_acc' ~ 99.479 %
and 'test_acc' ~ 98.958 %
Also, let’s look at predictions for 10 random samples from test set:
Woala, magic happened!
You can find the entire code of this article here.
Conclusion
The aim of this research was to build a machine-learning-powered Harry Potter’s Sorting Hat that could tell which Hogwarts House you belong to based on given features.
In this experiment, I’ve implemented a simple classification algorithm — Logistic Regression
using PyTorch
. It performs the function of the Sorting Hat. This investigation has shown that we can train algorithms to sort new data (inputs) into a predefined category/class (outputs).
If you would like to get more understanding and information about my experiment, I would suggest you visit the source code on my notebook and follow the video in freeCodeCamp Channel “Deep Learning with PyTorch — Free Six Week Course [Part 2]”.
Cheers!
P.S. I’m so glad to be mentioned in the third lecture “Deep Learning with PyTorch: Zero to GANs” offered by Jovian.ml in collaboration with FreeCodeCamp. Thanks, I’ll do my best for the community, there are still interesting challenges ahead. Stay tuned!
References
- My jupyter notebook: https://jovian.ml/ederev/hogwarts-logistic-regression
- Course: zerotogans.com