[ The Lord of the Rings: An F# Approach ] The Path of the Hobbits

Introduction

The Path of the Hobbits: Frodo and Sam on their Journey to Mordor to destroy the One Ring

I am extremely excited to present my contribution to 2017’s FSharp Advent Calendar and this blogpost is the first one in a series of 3 that involves acquiring and analyzing data from the Lord of the Rings mythos.

In this blogpost, I write about my process involving the acquisition, exploration and analysis of the Lord of the Rings and the Hobbit Movie Series data using F#. The goal of this post is to offer a completely unbiased answer to the following question:

Which Movie Series is better: The Lord of the Rings or The Hobbit Series?

The Lord of the Rings vs. The Hobbit Series

According to a plethora of online forums and my experience speaking to other ardent fans, the answer to the aforementioned question is extremely clear: The Lord of the Rings was, undoubtedly, the better movie series. As an aspiring Data Scientist, I want to quantitatively try to prove that this conjecture holds.

The three features I’ll be using to answer the question:

1. Average Score of Each of the Movie Series from Rotten Tomatoes.

2. The Return on Investment Calculated on the Basis of the Box Office Revenue and Film Production Cost.

3. Percentage of Academy Award Wins based on the Nominations.

For those interested in the Data Acquisition Process, I have a section at the very end going over this herculean task to acquire and clean the data for the analysis [ as with most cases, this step was where most of the time spent ].

Now, without further a ado:

Setting up

To conduct this analysis, we’ll be using FsLab, a conglomerate of Data Science libraries that makes data access, analysis and visualization using F# extremely simple.

The result of the Data Acquisition process was a CSV file with all the movie data that I load into a Deedle Data Frame that’s akin to a data frame in the R and Python world. Deedle is an exceptional Exploratory Data library developed by the good people of Blue Mountain Capital and other contributors.

The Structure of the CSV Data is:

Result of the Data Acquisition

Using Deedle, in less than 20 lines, we are able to:

  1. Load all the data in the CSV file into the Deedle Data Frame [Line 9]
  2. Set the index of the data frame to that of the Name [Line 12]
  3. Extract the pertinent series namely The Lord of the Rings and The Hobbit Series [Line 16]
Extracting the Data into a Deedle Data Frame

The Data Frame looks like the following, similar to the CSV but with a lot more magic associated with it. This data frame can be viewed here.

Brief Sneak Peak of what this Table look like from the Link. For the entire table view, the link will provide the best view.

Feature 1: Average Score from Rotten Tomatoes

Rotten Tomatoes

Rotten Tomatoes is an American review aggregation website for film and Television reviews. The scores are determined by top critics and by the audience are out of 100.

Let’s first create a Data Frame with just the Average Rotten Tomatoes scores of just the critics for both the series. From the Movie Series Data Frame we created during the setup phase, we slice the RottenTomatoesScore column and get back a new data frame with just the said column.

Creating a Data Frame with just the Rotten Tomatoes Score of the two Movie Series and create a table.
Table with the Rotten Tomatoes Scores

Result

We can clearly observe that the Lord of the Rings Series has a higher average Rotten Tomatoes score. That’s one point to team Lord of the Rings and zilch for Team Hobbit.

Scores

The Lord of the Rings: 1

The Hobbit: 0

Feature 2: Return on Investment

In this case, Total Revenue is the Box Office Revenue and Total Cost is the Production Budget

Next, let’s compare the Return on Investment of the two Film Series based on the production budget and the box office revenue.

As before, let’s start off by creating a new Deedle data frame with just the Production Budget and the Box Office Revenue by slicing those columns from the Movie Series Data frame.

We then compute and add the profit series that’s the Box Office Revenue subtracted by the Production Budget in an effort to chart out the Revenue, Cost and Profit. At this point, we also define the ROI series before we clean up the column names from their raw form.

The ROI series defined by the Profit divided by the cost and then that result multiplied by a 100 to convert it to a percentage.

Compute the ROI and Profit Series and then Clean up the Data Frame for better visualization

F# has an awesome Charting library that we take advantage of by creating a Column based chart of the Revenue, Cost and Profit.

Charting the Budget, Revenue and Profit
Budget, Revenue and Profit Comparison. Clearly, the Profit for the Lord of the Rings Series seems higher than that of The Hobbit Series

You can interact with the chart here.

And then, generate a chart of the Return on Investment.

Charting the ROI
ROI % Between the two series. Like the previous case, the Lord of the Rings ROI % is higher.

And the interactive chart is available here.

Result

Another win for the Lord of the Rings Series. Just a point worth noting, regardless of which movie series won in this category, those Returns on Investment are ridiculously high; a testament to the amazing acumen of Peter Jackson’s film crew. As a comparative benchmark, on an average, S&P 500 has given returns close to around 10% in the past 90 years.

Scores

The Lord of the Rings: 2

The Hobbit: 0

Feature 3: Academy Award Wins to Nominations

Finally, our last feature is the Percentage of Academy Award Wins based on the Nominations. As before, we create a new data frame with the Nominations and Wins for both the film series. The losses are the Nominations subtracted by the Wins for which we add a new column to the Academy Award Data Frame.

We’ll be visualizing values for this feature via a Pie Chart so we’ll have to generate these for the two movie series separately.

Lord of the Rings

This involves extracting the Lord of the Rings Wins and Losses from the data frame and then creating a pie chart with this information.

56.7% Wins of all Nominations isn’t bad at all. There were a total of 30 Nominations.

The interactive Pie Chart can be found here.

The Hobbit

Similar to the previous case, we first get the Wins and Losses from the data frame and then create a pie chart from the results.

Just 1 win out of 7 Nominations. :(

The interactive Pie Chart for the Hobbit Movie Series can be found here.

Combining the Results

The Lord of the Rings Series Wins Per Nominations % is significantly higher.

The interactive chart can be found here.

Result

Lord of the Rings takes the cake for the 3rd time with the higher Wins based on the Nominations percentage rate. A point worth noting is that The Return of the King won in all 11 categories it was nominated for.

Scores

The Lord of the Rings: 3

The Hobbit: 0

Result of Analysis

Here are the final scores:

The Lord of the Rings: 3

The Hobbit: 0

Conclusively, there was some truth in the banter in favor of the Lord of the Rings Series in the multitude of forums comparing the two movie series. In all three feature comparisons, The Lord of the Rings movie series turned out to be the winner.

Now, this result in no way means that the Hobbit film series was a bad one; in my opinion, it was an awesome series but couldn’t live up to the stature of the Lord of the Rings in any way.

Another common complaint was the fact that the Hobbit movies were blown out of proportion with respect to the book with the inclusion of subplots such as unnecessary love story between Tauriel and Kili [ Why was that subplot necessary?! ].

Grumpy Denethor is Grumpy

The Hobbit book is around 300 pages; if the plot of movie was followed exactly like the book, that would mean each movie would encompass around 100 pages.

Alpha is defined as the number of pages per movie

Unlike The Hobbit, the Lord of the Rings had around 1000 pages and 3 movies would imply a movie would encompass around 300 pages. Now, that’s a lot more material to cover in a movie and less blow out of the plot in general implying that if all things kept constant, it won’t tick off Tolkien purists as much.

Higher Alpha for Lord of the Rings implies a lot more information to base a movie off
Yeah, Dude! Gandalf wins either way.

Data Acquisition

As mentioned before [ and will be mentioned again ], the majority of the time spent was in the Data Acquisition process. What made the process relatively easy was the use of the Html Provider from the FSharp.Data library and the data being easily available from the series Wikipedia pages that can be found here and here.

The Html Provider takes the Html Document Object Model and on the fly, generates types based on the data present. Type providers were in fact, the reason why I started dabbling with F# and zealously advocate its usage.

Creating the Html Provider based on the Lord of the Rings film series

The schema for the data was the following:

Columns:

  1. Names of Movie / Series
  2. Budget In Millions
  3. Box Office Revenue in Millions
  4. Academy Award Nominations
  5. Academy Award Wins
  6. Rotten Tomatoes Scores

And therefore, I created a new record type that represented this information and a bit more.

Record representing the Movie Info

Luckily, all this data for the film series was available on each of the Wikipedia pages in the form of Html Tables that the Html Provider recognizes as a valid type based on the data. I used the example of the Lord of the Rings series below to get all the data.

Budget and Box Office Revenue

We extract the Budget and Box Office Revenue from the first table of the page that was aptly named ‘Table 1’ from the Type Provider.

Data from Table 1
Mumbo Jumbo Involved with Extracting the Data from Table1

Academy Award Nominations and Wins

Next, we extract the Academy Award Nominations and Wins from the List “Academy Awards”

Data from List Academy Awards
Data Extraction from the list
Movie Helpers

Rotten Tomatoes Score

And finally, the data for the Rotten Tomatoes Score is extracted from another table called “Public and critical response”. For the film series Rotten Tomatoes score, I took the average of the individual movies.

Data from the Public and Critical Response Table
Here lies code I am not too proud of: Data Acquisition isn’t a bed of roses.

Great, we have all the data for this part of the analysis. Let’s encompass the data into the previously record type. The entire data acquisition code can be found: https://github.com/MokoSan/FSharpAdvent/tree/master/FSharpAdvent.DataAcquistion in the HobbitMovieData.fsx and LotrMovieData.fsx scripts.

Overall MovieInfo

Conclusion

I hope you enjoyed my comparative analysis between the Hobbit and Lord of the Rings! Please let me know if you have any questions or feedback. Also, feel free to use the data in any way. I purposely wrote out the result to a CSV to make the data language and platform agnostic.

The code for this blog post and the others in the series is available here.

A big thanks to my following friends and mentors who helped me out with this blogpost but specifically, Nathaniel Benzaquen, Jack Pappas and Ernst Henle who was the professor for my Data Science class as a part of the University of Washington’s Data Science program I was a part of.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.