Oscar Movie and Rotten Tomatoes Scores: Part 1
As the Oscars are approaching, I’m increasingly interested in looking at all sorts of data breakdowns; reviewer scores, budgets, box offices, and more.
For now I want to start simple. Just with Rotten Tomatoes scores, which I like because they’re a useful metric to determine how could a movie has been reviewed to be since it’s aggregated over a lot of reviewers. I also like that it’s a consistent numerical scale, even if it can sometimes be misleading.
To collect the data I made a spreadsheet of all movies nominated for each award and their scores. This did run into some instances of duplicated data, for example every time The Revenant was nominated, it has a row in the table. I know this is redundant, but I would rather keep the extra data and filter it out later with R.
This is an example of what the data looked like:
It goes on to be over 100 lines, which you can view all of here: https://github.com/robinsturm/moviedata/blob/master/oscars.csv
I did a quick graph of all the scores in Excel (where I first wrote down all the data), but I opted to switch to R for the rest of the analysis for a variety of reasons but mostly because R is just more sophisticated.
While this is an alright graph, I don’t like that it left in all of the 0 values (which are those for short films that don’t have RT scores) and I even more don’t like that you can’t see all the labels. I probably could have continued to tweak with Excel, but I just didn’t want to.
It took a lot of time for me to get my bearings using R again. But I ended up finding enough tutorials and hints to accomplish what I wanted with ggplot2
First of all, this graph is so much prettier! And I was also able to drop the 0 values, and sort the movies by descending score. Now we can get some sense of what most of the scores are. You can see a big version of the chart here: https://raw.githubusercontent.com/robinsturm/moviedata/master/RTBarChart_big.png
We can start to see that most of the scores are in the 90–100 range — which is really good! — and then there’s a big of a drop off for the last 15 or so movies.
We can also see the statistics of the entire data set with R by using the summary() operation
For the most part this isn’t that interesting. You don’t need statistics to know that The Revenant was nominated for 12 awards when that was CNN’s headline when the nominations first came out. Nor do you need statistics to know the number of nominated movies in each category.
However, I thought the Score statistics are where it gets more interesting. This shows a little bit of the basic statistics of all movie scores, which we can also visualize with a box-and-whisker plot
This shows what’s not terribly hard to believe: most Oscar nominated movies are considered good movies! It’s almost more interesting the ones which aren’t good movies, the ones that are considered outliers: Fifty Shades of Grey, Spectre, and The 100-Year-Old Man Who Climbed out the Window and Disappeared. Two of those movies are nominated for Original Song and nothing else, so that makes me think that certain categories which relate less to the actual movie probably have lower scores.
So then I wanted to break down the box-and-whisker plots by category, which was actually really simple with ggplot and R.
I made three versions of this: the first one which ordered the categories alphabetically and wasn’t much use, and then one which ordered the categories by median and the last which ordered the categories by mean.
It’s up to you which one you find like more. Personally, I prefer the one that sorts by mean. Since there are so few movies in each category, I think the order should reflect the average more than just the median.
There’s a lot of interesting things that I noticed from looking at these graphs. Unsurprisingly, Original Song is the worst scoring category, go ahead and blame 50 Shades of Grey for that. I was a little surprised about how poorly the movies in Leading Actor score. Whereas most of the categories are mostly in the 90s, Leading Actor is much lower with only one of the nominees in the 90s (The Martian — 93). I find it especially interesting because all of the movies nominated in this category are about (and titled after!) the person who’s nominated.
I do think it’s interesting that the on-average best category isn’t Best Picture. That’s pretty counter-intuitive. And there are so many other interesting things you can pluck out, it’s hard to name them all.
I’m admittedly not the biggest fan of The Revenant. And I’d like to think that the stats somewhat back me up on this. The Revenant has a (current) score of 82, and you can easily see where it stacks up if you draw a line at 82.
All the data points that directly hit on the green line are The Revenant. In three cases, the Revenant is an outlier on the low end, in four cases it is the minimum value, in two cases it is the lower quartile, and in only two cases it is the median.
That’s all I’ve got for now. I’m going to post the R script and all of the charts to github later.
Next time I’ll probably look at release dates and movie studios.