This is the second blogpost of my three part contribution to 2017’s FSharp Advent Calendar.
In this article, I make use of character data from the Lord of the Rings mythos to first do some exploratory data analysis and then to construct a model that predicts the Race of a character based on the First Name.
J. R. R. Tolkien fastidiously developed the Phonology that belonged to the languages associated with a particular race. For example, Quenya was used by the Elves and lacks vowel harmony and consonant gradation while Khuzdul was used by the Dwarfs extends each vowel in its speech and writing.
Inspired by the extensive information in the Lord of the Rings Appendices, I wanted to prove that:
Characters of the same race, will have similar names.
Disclaimer:We don’t have enough data for each of the races to create a proper model but for sake of trying to gain some insight about the data, we’ll continue to proceed and check if we reach the same conclusion.
The races that we’ll be exploring data for are the following:
In the Third Age, they incarnated as the 5 wizards; Saruman, Gandalf and Radagast are probably the most important wizards that played a major role in shaping the future of Middle Earth.
As before, the Data Acquisition process is given in the very end but the source of the data was the Character’s Page from the Lord of Rings wiki that can be found here.
The output of the data acquisition process is a CSV file that I load into a Deedle Data Frame.
Additionally, dropping the unnecessary URL column and indexing the frame by the Name of the character was the next step.
Then, to visualize this cleaned data frame, we create a Table via F#’s Charting library.
The entire table is available here.
Distribution of Characters by Race
Let’s start off by examining the distribution of the races. We’ll need a new Data Frame with the race and number of members of the race. We first segregate all the races, get a count of the characters per race and then construct a data frame on the basis of the counts.
Once we have the Data Frame on the base of the race count, let’s first peer in an observe the Data Frame as a table.
Next, let’s produce a chart based on this data.
The interactive chart can be found here.
Now, that we observe that there are unequal number of characters per race, let’s start by extracting exactly the information we’ll need to start the Race Prediction process. For the sake of simplicity, we are using just the First Name of the character and hence we create a new data frame from just this data and drop all the other columns.
We’ll now be creating the model but first, let’s review the algorithm, distance function and all the other accoutrements associated with this step.
The classification algorithm that we’ll be using is the K-Nearest Neighbors , a Classification based Supervised Learning algorithm that can also be used to conduct Regressions.
I’ll go into more detail about the algorithm in a later section but I chose this algorithm over others because of its simplicity as it has only two parameters that can be tuned namely: the distance metric and the number of neighbors.
Before we move on, I’d like to clean the data even further. I don’t believe we’ll be getting any valuable information from 11 records for the case of the Maiar Race. Hence, for the sake of doing more with less, we’ll not be using those records in our model creation process. Sad that that had to be done as Gandalf is probably one of my favorite characters but that’s just the nature of model creation: you gotta do what you gotta do.
Once the data is cleaned and ready, the first step is to partition the data we have into Training and Testing data. We’ll be training our model out of the Training Data and will use the Testing Data to conduct the cross validation. Cross validation is the technique of evaluating the efficacy of the models we have created.
Data Partitioning ( Data ) = Training Data + Testing Data
Training Data + Algorithm = Model
Model + Testing Data = Cross Validation
It’s important to randomly partition our data into the Training and Testing data to insure that we aren’t at the mercy of the distribution of the records. We’ll be splitting our data into the Training and Testing data in a 70–30 ratio.
To partition our data, we’ll be using one of FsLab’s coolest features: The R Provider which allows us to interoperate code and data between R and F#. Using the R Provider, we’ll be installing R’s caret package that contains functions to aid us in the stage of randomly splitting our data into the Training and Test data.
The code give below first installs the caret package and then splits the filtered character data frame into the training and testing data based on randomly assigned but deterministically proportioned indices.
Levenshtein Distance gives us a measure of similarity between two strings. The distance is number of changes namely, the number of deletions, insertions or substitutions required to transform one string into another.
For example, the Levenshtein Distance between Fili and Kili is 1 since their character names are one character apart. And for Sméagol and Déagol, the Levenshtein Distance is 2.
Knowing about Levenshtein distance is essential for our analysis as our assumption from the very beginning is that different races are named differently and therefore the more different a character is based on the race, the higher the Levenshtein distance will be.
The K-Nearest Neighbors algorithm classifies data points on the basis of the closeness of the majority of ‘K’ neighbors or ‘K’ closest data points based on some defined distance function.
The algorithm loads all the training points in memory and then based on the distance function [ in this case the Levenshtein Distance ] and the value of ‘K’ it acquires the neighbor points. And thereafter, from the collection of the K neighbor points, the algorithm determines which class is the one in majority and that becomes the label for the new data point.
This explanation is best supplemented with a diagram I found somewhere on the internets.
For a) Since k = 1, the neighbor points is just one negative point implying the label of x is negative.
For b) Since k = 2, the neighbor points consist of one positive and one negative point implying that the label of x is unknown.
For c) Since k = 3, the neighbor points consist of two positive and one negative point implying that the label of x is positive.
Now that we have the training and testing data separated, it’s time to finally create the model. We’ll be using Accord.NET’s implementation of the K-Nearest Neighbors algorithm the source code of which can be found here.
The inputs to the K-Nearest Neighbors object is an array of strings as the features and an array of integers as the labels. This category of classification falls under the branch of Multiclass Classification since we are dealing with 3 or more races.
Conversely, when the classification involves trying to determine if a data point belongs to either of 2 classes such as the case of trying to discern whether a character is good or evil, the classification is a Binary Classification and the range of the output values will be either 0 or 1.
Now, it’s time to generalize the training step based on parameter ‘K’ i.e. the number of nearest neighbors to consider while classifying a new data point.
Let’s take a model with k = 5 on a test drive by first creating a function like in R called predict that takes trained model and parameters to test on.
I am proud to have a Hobbit name! Exactly in line with my definition of a good life: Good ale, A cornucopia of good food and great company!
Once we have trained our model, we want to be cross validate it’s accuracy based on the test data in an effort to figure out the optimal value of K that gets us the best results and more generally, to see how our model has performed.
We make use of the Generalized Confusion Matrix that based on the model, testing inputs and outputs gives us an accuracy amongst other properties of the trained model specifically for Multi-class Classifications. Like the previous step, we generalize the creation of the confusion matrix based on some ‘K’.
Now, let’s create a range of values of K and plot the accuracies and errors with respect to these values of k.
Both the Accuracy to K and Error to K charts conclude that with low values of K, we get the highest accuracy. Sure, we get ~98.92% accuracy @ K = 2 but that metric isn’t enough; this isn’t the best place for our model to be because this means that the model fits best only to 1 or 2 nearest points. This conclusion implies that the model will be extremely close to the training data and will have a High Variance And Low Bias; in other words, this model suffers from the problem of overfitting.
This can be attributed to the fact that we have not many equally distributed data points [ ~756 aren’t enough ] per class. Unfortunately, we can’t do much about the lack of characters. As mentioned before, this result was somewhat expected.
The process of acquiring the Character data required a lot more time than the other cases simply because of the sheer number of characters I wanted information for. The source of the data, as mentioned before, was the Lord of the Rings Wiki that unfortunately, didn’t have that reliable of an API.
I started off by defining the schema for the data and the corresponding domain model I’d like the data to conform to that consisted of the following:
- Character Name
- Url of Character Page
- Race of the Character
Getting the data involved the following 2 steps:
- Getting all the Character URLs from the 5 Character Pages
- Discerning the Race of the Character from the HTML
Getting All the Character URLs
The URLs of all the characters were available on the 5 characters pages each of which I created an HTML Type Provider for using FSharp.Data’s HtmlProvider type.
After creating a HtmlProvider for each of the 5 character pages, I subsequently write out all the lists associated with HTML data of each of the alphabets from the lists. Subsequently, I developed a function to flatten a sequence of sequences of incomplete CharacterInfo records as there were multiple alphabetically associated lists per each page.
At this point, I was pretty much setup to start figuring out how to determine the Race of a character from the markup of the URL.
Discerning the Race of a Character from the HTML
Figuring this part out was the most challenging and required quite a bit of trial and error. The way I ended up figuring out the race of a character was via a CSS class of the aside tag of the introductory table of a character. This will make a lot more sense with an example: I am considering Aragorn’s LOTR Wikia Page that can be found here.
In the aside tag, the theme of Aragorn is theme-Men-Dunedain that represents the CSS class for all Dúnedain Men that ascended from the Númenóreans. I found all the possibilities of these for all our pre-decided races before jumping in.
Similarly, for the case of Frodo who is a Hobbit:
Once I got a list of all the themes for all Humans, Dwarfs, Elfs, Maiar and Hobbits, I enumerated the URLs from step 1 and completed the once incomplete Record types and persisted them as a CSV file.
Whew! That was quite the adventure. The code for the data acquisition can be found here. The first few rows of the finalized data set looks like:
The code for this blogpost can be found here. As always, please let me know if you have any questions. Despite the model not working out as expected, learning how to use the libraries and tools used while creating this blogpost was a great experience and I’ll be more than happy to spread that knowledge even further.
Feel free to make full use of the data associated with the blogpost that can be found here.
I’d like to thank my professor, Ernst Henle for helping me solidify some of the Machine Learning concepts in this blog post.