# Introduction

This blogpost is the last one in a series of three blogposts as my contribution to 2017’s FSharp Advent Calendar.

In this article, I’ll be exploring data related to the Lord of the Rings Book and Scene-by-Scene Character Interactions from the Movie. And then, I’ll be and finally try to answer the following question:

Which relationship among the members of the Fellowship in The Lord of the Rings was the best one?

The relationships we’ll be considering are:

The way we’ll deduce which one the best relationship is going to be based on the maximum sum times a 100 of the:

1. Count of Chapters Character Mentions of all members of a relationship together in all chapters divided by the Total Number of Chapters.
2. Count of the Movie Scenes where all members of a relationship are present divided by the Total Number of Movie Scenes.

Therefore, for the Relationship Score of relationship ‘i’ is given by:

where:

b is the count of the chapter mentions of all members of the relationship in all chapters.

m is the count of the movie scene mentions of all members of the relationships in all scenes.

n is the total number of chapters

p is the total number of movie scenes

Along the way, I discovered it was pretty much irresistible to analyze the data some more and so, I took a detour from the main answer and answered the following questions in an effort to better utilize the data:

1. What does the Distribution of the Word Count of the Lord of the Rings books look like?
2. Who are the Top 3 Most Important Characters in the Books?
3. Who are the Top 3 Most Important Characters in the Movies?
4. Which is the most Dominant Race in the Movies?

Like the blogposts before this one, I include a section on how I acquired the data at the very end.

Let’s get started!

# Setting up

The result of the Data Acquisition process was two files: one was the JSONized version of the Lord of the Rings Book Series data and the second was a CSV file of the Scene-by-Scene Character Interactions from the Movie series that I acquired from the internets.

## Book Data Extraction

Our domain model consists of a Discriminated Union consisting of all the Book Names and Book Data Record Type that we used to serialize the data from in the Data Acquisition process.

Additionally, we include the ChapterBasedCharacterInteraction Record Type that links each chapter to all the characters mentioned in it that’ll be used later in our analysis.

Reading the data in was an easy step. Since we used the same BookData Record Type as in the Data Acquisition step, all we do in this step is to deserialize the book data from raw string form to an array of the BookData record type using NewtonSoft.Json.

The next step was the split the entire Lord of the Rings series into the 3 books based on the Discriminated Union of the Book Names we defined before.

Here is what the JSONized Book Data looks like:

## Scene-by-Scene Movie Data Extraction

Since we are dealing with a CSV file, we get the benefit of using the CSVProvider from FSharp.Data which is nothing short of awesome!

From here on, I figured we could benefit from some Non-Functional collections defined in the C# Base Class Library; I did this to demonstrate that despite the fact that F# is a functional-first language, it can also be an equally effective Imperative and / or Object Oriented language.

Therefore, we are taking the contents of the Scene-by-Scene Movie data and creating a dictionary out of it with the Scene Name as the Key and a Hash Set of the characters in that scene as the value.

The raw data looks like:

# Detours

We plan to take a bunch of detours from answering the main question of this blogpost simply because the data we have at this point is in good shape to conduct some supplementary analysis.

Detour Question #1: What does the Distribution of the Word Count of the Lord of the Rings books look like?

To answer this first question, we’ll need a collection of individual words per Book. For this we start off by taking the ChapterData field in the BookData record type and cleaning it by splitting the data on some appropriate delimiters and then removing some non-alphabetical characters.

Once we have an ugly looking function in place that handles all the splitting and converting the huge string of ChapterData into nicely split up chunks of individual words, we can proceed to get the word counts on a book by book basis.

We then proceed to create a Deedle Data Frame of the Book and Word Counts in an effort to chart the distribution.

The chart looks like:

And can be interacted with here.

Statistics from the Book based Word Counts

Next, let’s get the following Descriptive Measures:

1. Mean of Word Counts
2. Standard Deviation of Word Counts
3. Book with Max Word Counts
4. Book with Min Word Counts

For which we use the Stats portion of the Deedle library.

And the results are:

1. Mean of Word Counts: 156,220 words
2. Standard Deviation of Word Counts: 22,427.095 words
3. Book with Max Word Counts: 180,076 for The Fellowship Of The Ring
4. Book with Min Word Counts: 135,566 for The Return Of The King

Detour #2: Who are the Top 3 Most Important Characters in the Books?

Our next task is to extract out Character Mentions from book for which we define a new Record Type called “CharacterMentions” that gives us the Character Name and the Number of times they have been mentioned in chapters.

This process involves using a list with preselected central characters and their respective aliases. We use that list to search for the existence of the character mentions in the entire book and get the count of all the chapters that do contain the character names via a function called characterMentions.

We make use of the characterMentions function to go through all our preselected Characters in all chapters of the Lord Of the Rings Book Series and get max count.

`[{CharacterName = "Frodo"; CharacterMentions = 1980;};   {CharacterName = "Sam"; CharacterMentions = 1321;}; {CharacterName = "Gandalf"; CharacterMentions = 1117;}; ... ]`

Note, none of the aliased names showed up even close to the top 3 so I didn’t take the time to write some aggregation logic.

The top answer of this question was a fairly obvious one. From the very beginning Frodo has been tasked to destroy the One Ring and the entire book involves different stages in aiding his journey.

Detour #3: Who are the Top 3 Most Important Characters in the Movies?

Now that we have got the answer from the perspective of the Book Data, let’s take a look at what the Movie data gets us.

From our previously created Dictionary keyed by the Scene Name and with Value of a HashSet of Characters in that scene, we extract a list of all the scenes for a specific character and then get the size of that filtered collection to get the number of scenes with that character.

`val getAllCharacterCounts : (string * int) list =[("Aragorn", 61); ("Frodo", 57); ("Sam", 55); ("Gandalf", 51);("Pippin", 47); ("Gimli", 46); ("Merry", 44); ("Legolas", 34);("Gollum", 21); ("Boromir", 15); ("Faramir", 15); ("Saruman", 10)]`

Seems like the data for the Movie gives us different results than the Books. It turns out that Aragorn is main character of the Movie Series with Frodo and Sam being 2nd and 3rd respectively in terms of occurence.

Detour #4: Which is the most Dominant Race in the Movies?

Since we have a column of Race of the Characters for all Scenes as a part of the Scene-by-Scene Movie data, let’s get the most dominant of races i.e. the race with the most amount of representation in the Movies.

We do this by considering the major races: Hobbit, Men, Dwarf and Elf and then adding their counts.

`val mostDominantRace : string * int = ("Men", 240) `

# Best Relationship Score Calculation

Alright, it’s finally time to get an answer for the main question of this blogpost. Let’s first start by defining all the relationships we care about and a new type called “RelationshipScore” that is a tuple of the two characters in the relationship and the relationship score.

Let’s compute the individual components of the sum of the aforementioned calculation of the score.

1. Count of Chapters Character Mentions of all members of a relationship together in all chapters divided by the Total Number of Chapters.

We first start off by populating a list of the previously defined ChapterBasedCharacterInteractions Record Type for all chapters in the book.

We do this by getting split words per chapter and then check for the existence of any of the members of the character lists in that chapter. We accumulate an array of the ChapterBasedCharacterInteractions based on the Book data given. Hence, each item in that collection will be of the characters mentioned in that chapter.

Once we have this ChapterBasedCharacterInteractions collection, we check for the existence of both members of a relationship for all relationships. We then proceed to compute the relationship score by dividing the count of common chapter mentions of a relationship by the total number of chapters.

`[("Gimli", "Legolas", 0.1329787234);  ("Merry", "Pippin", 0.2234042553); ("Frodo", "Sam", 0.25)]`

Awesome! We have one piece of our two piece puzzle.

2. Count of the Movie Scenes where all members of a relationship are present divided by the Total Number of Movie Scenes.

Similar to the previous calculation, our goal is to get all the scenes with the relationships and then counting them and dividing the number of scenes where both the characters are present by the total number of scenes.

`[("Gimli", "Legolas", 0.1595744681);  ("Merry", "Pippin", 0.170212766);  ("Frodo", "Sam", 0.2446808511)]`

And finally, to compute the sum of the results from 1. and 2. and multiply it by 100 to get our final scores.

`val relationshipScores : RelationshipScore list =[("Frodo", "Sam", 49.46808511);  ("Merry", "Pippin", 39.36170213); ("Gimli", "Legolas", 29.25531915)]val bestRelationship : RelationshipScore = ("Frodo", "Sam", 49.46808511)`

The relationship between Frodo and Sam is undoubtedly best one and we quantitatively proved it! The sheer loyalty Sam showed Frodo from the very beginning of their peril induced journey to destroy the Ring till the very end of both the book and series cannot be matched by any other character anywhere. It was this true friendship, teamwork and trust that eventually saved Middle Earth and in my opinion, Sam is the real hero of the series and not Frodo.

# Data Acquisition

The data acquisition process to get the Book data involved getting my hands on a text friendly version of the books and then adding some identifiable characters as Chapter delimiters. In this case, it was < > to denote chapter names and a ~ after the name of the Book to specify the chapter name like:

`< The Fellowship Of The Ring ~ A Long-expected Party >`

Once I had cleaned up all the book and chapter names, the next step was define the record type and create the BookData array to serialize and save to a JSON file using the Newtonsoft.JSON library.

As mentioned before I acquired the Movie data was acquired from the interwebs; I believe the original link was from here but I got the data based on this discussion on Github. Either way, thank you to person / people who had this data out!

# Conclusion

As always, writing up this blogpost was tremendous amount of fun! The code for this blogpost can be found here. As always, my data is your data and can be found here. I have also added a JSON version of the Hobbit book data.

Please let me know if you have any questions. I realize that some of the ways I did the aggregation weren’t functionally pure and there could have been a better way of doing the computations; I’d greatly appreciate any suggestions / feedback!

This is my last blogpost for my contribution and I want to thank you for your attention this far! This has been a great experience and I have learnt a lot more things in both the world of F# and Data Science by doing this project. Happy Holidays!

--

--

C#, Python, F# and Data Science Fanatic. Data Science @ UWisconsin. ECE @ Carnegie Mellon Grad. Formerly in FinTech.