[ The Lord of the Rings: An F# Approach ] The Path of the Hobbits
I am extremely excited to present my contribution to 2017’s FSharp Advent Calendar and this blogpost is the first one in a series of 3 that involves acquiring and analyzing data from the Lord of the Rings mythos.
In this blogpost, I write about my process involving the acquisition, exploration and analysis of the Lord of the Rings and the Hobbit Movie Series data using F#. The goal of this post is to offer a completely unbiased answer to the following question:
Which Movie Series is better: The Lord of the Rings or The Hobbit Series?
According to a plethora of online forums and my experience speaking to other ardent fans, the answer to the aforementioned question is extremely clear: The Lord of the Rings was, undoubtedly, the better movie series. As an aspiring Data Scientist, I want to quantitatively try to prove that this conjecture holds.
The three features I’ll be using to answer the question:
1. Average Score of Each of the Movie Series from Rotten Tomatoes.
2. The Return on Investment Calculated on the Basis of the Box Office Revenue and Film Production Cost.
3. Percentage of Academy Award Wins based on the Nominations.
For those interested in the Data Acquisition Process, I have a section at the very end going over this herculean task to acquire and clean the data for the analysis [ as with most cases, this step was where most of the time spent ].
Now, without further a ado:
To conduct this analysis, we’ll be using FsLab, a conglomerate of Data Science libraries that makes data access, analysis and visualization using F# extremely simple.
The result of the Data Acquisition process was a CSV file with all the movie data that I load into a Deedle Data Frame that’s akin to a data frame in the R and Python world. Deedle is an exceptional Exploratory Data library developed by the good people of Blue Mountain Capital and other contributors.
The Structure of the CSV Data is:
Using Deedle, in less than 20 lines, we are able to:
- Load all the data in the CSV file into the Deedle Data Frame [Line 9]
- Set the index of the data frame to that of the Name [Line 12]
- Extract the pertinent series namely The Lord of the Rings and The Hobbit Series [Line 16]
The Data Frame looks like the following, similar to the CSV but with a lot more magic associated with it. This data frame can be viewed here.
Feature 1: Average Score from Rotten Tomatoes
Rotten Tomatoes is an American review aggregation website for film and Television reviews. The scores are determined by top critics and by the audience are out of 100.
Let’s first create a Data Frame with just the Average Rotten Tomatoes scores of just the critics for both the series. From the Movie Series Data Frame we created during the setup phase, we slice the RottenTomatoesScore column and get back a new data frame with just the said column.
We can clearly observe that the Lord of the Rings Series has a higher average Rotten Tomatoes score. That’s one point to team Lord of the Rings and zilch for Team Hobbit.
The Lord of the Rings: 1
The Hobbit: 0
Feature 2: Return on Investment
Next, let’s compare the Return on Investment of the two Film Series based on the production budget and the box office revenue.
As before, let’s start off by creating a new Deedle data frame with just the Production Budget and the Box Office Revenue by slicing those columns from the Movie Series Data frame.
We then compute and add the profit series that’s the Box Office Revenue subtracted by the Production Budget in an effort to chart out the Revenue, Cost and Profit. At this point, we also define the ROI series before we clean up the column names from their raw form.
The ROI series defined by the Profit divided by the cost and then that result multiplied by a 100 to convert it to a percentage.
F# has an awesome Charting library that we take advantage of by creating a Column based chart of the Revenue, Cost and Profit.
You can interact with the chart here.
And then, generate a chart of the Return on Investment.
And the interactive chart is available here.
Another win for the Lord of the Rings Series. Just a point worth noting, regardless of which movie series won in this category, those Returns on Investment are ridiculously high; a testament to the amazing acumen of Peter Jackson’s film crew. As a comparative benchmark, on an average, S&P 500 has given returns close to around 10% in the past 90 years.
The Lord of the Rings: 2
The Hobbit: 0
Feature 3: Academy Award Wins to Nominations
Finally, our last feature is the Percentage of Academy Award Wins based on the Nominations. As before, we create a new data frame with the Nominations and Wins for both the film series. The losses are the Nominations subtracted by the Wins for which we add a new column to the Academy Award Data Frame.
We’ll be visualizing values for this feature via a Pie Chart so we’ll have to generate these for the two movie series separately.
Lord of the Rings
This involves extracting the Lord of the Rings Wins and Losses from the data frame and then creating a pie chart with this information.
The interactive Pie Chart can be found here.
Similar to the previous case, we first get the Wins and Losses from the data frame and then create a pie chart from the results.
The interactive Pie Chart for the Hobbit Movie Series can be found here.
Combining the Results
The interactive chart can be found here.
Lord of the Rings takes the cake for the 3rd time with the higher Wins based on the Nominations percentage rate. A point worth noting is that The Return of the King won in all 11 categories it was nominated for.
The Lord of the Rings: 3
The Hobbit: 0
Result of Analysis
Here are the final scores:
The Lord of the Rings: 3
The Hobbit: 0
Conclusively, there was some truth in the banter in favor of the Lord of the Rings Series in the multitude of forums comparing the two movie series. In all three feature comparisons, The Lord of the Rings movie series turned out to be the winner.
Now, this result in no way means that the Hobbit film series was a bad one; in my opinion, it was an awesome series but couldn’t live up to the stature of the Lord of the Rings in any way.
Another common complaint was the fact that the Hobbit movies were blown out of proportion with respect to the book with the inclusion of subplots such as unnecessary love story between Tauriel and Kili [ Why was that subplot necessary?! ].
The Hobbit book is around 300 pages; if the plot of movie was followed exactly like the book, that would mean each movie would encompass around 100 pages.
Unlike The Hobbit, the Lord of the Rings had around 1000 pages and 3 movies would imply a movie would encompass around 300 pages. Now, that’s a lot more material to cover in a movie and less blow out of the plot in general implying that if all things kept constant, it won’t tick off Tolkien purists as much.
As mentioned before [ and will be mentioned again ], the majority of the time spent was in the Data Acquisition process. What made the process relatively easy was the use of the Html Provider from the FSharp.Data library and the data being easily available from the series Wikipedia pages that can be found here and here.
The Html Provider takes the Html Document Object Model and on the fly, generates types based on the data present. Type providers were in fact, the reason why I started dabbling with F# and zealously advocate its usage.
The schema for the data was the following:
- Names of Movie / Series
- Budget In Millions
- Box Office Revenue in Millions
- Academy Award Nominations
- Academy Award Wins
- Rotten Tomatoes Scores
And therefore, I created a new record type that represented this information and a bit more.
Luckily, all this data for the film series was available on each of the Wikipedia pages in the form of Html Tables that the Html Provider recognizes as a valid type based on the data. I used the example of the Lord of the Rings series below to get all the data.
Budget and Box Office Revenue
We extract the Budget and Box Office Revenue from the first table of the page that was aptly named ‘Table 1’ from the Type Provider.
Academy Award Nominations and Wins
Next, we extract the Academy Award Nominations and Wins from the List “Academy Awards”
Rotten Tomatoes Score
And finally, the data for the Rotten Tomatoes Score is extracted from another table called “Public and critical response”. For the film series Rotten Tomatoes score, I took the average of the individual movies.
Great, we have all the data for this part of the analysis. Let’s encompass the data into the previously record type. The entire data acquisition code can be found: https://github.com/MokoSan/FSharpAdvent/tree/master/FSharpAdvent.DataAcquistion in the HobbitMovieData.fsx and LotrMovieData.fsx scripts.
I hope you enjoyed my comparative analysis between the Hobbit and Lord of the Rings! Please let me know if you have any questions or feedback. Also, feel free to use the data in any way. I purposely wrote out the result to a CSV to make the data language and platform agnostic.
The code for this blog post and the others in the series is available here.
A big thanks to my following friends and mentors who helped me out with this blogpost but specifically, Nathaniel Benzaquen, Jack Pappas and Ernst Henle who was the professor for my Data Science class as a part of the University of Washington’s Data Science program I was a part of.