IMDb vs Rotten Tomatoes: The Wisdom of Crowd Goes to The Movies

YS Chng
8 min readNov 24, 2018

--

It is once again the end of the year and movies are starting to flood the theatres. Spoilt for choice, my friends and I had to rely on movie review sites to decide which movie to catch. After checking that “Venom” scored a decent 7.0/10 on IMDb, I suggested that we could go see that film. One of my friends immediately protested against the idea, citing Rotten Tomatoes 29% rating.

Why would there be such a huge discrepancy? Are the scores from both sites even comparable? But wait, how do they even come up with these scores?

If you’re an avid movie-goer, you should be quite familiar with such a dilemma. But have you ever wondered what does a 7.0/10 on IMDb, and a 29% on Rotten Tomatoes really mean? To cut the long story short, the two movie review sites actually use a vastly different scoring system and cannot be compared on the same scale.

IMDb rating distribution of “Venom (2018)” as of 18 Nov 2018.

IMDb allows all its registered users to cast a vote (from 1 to 10) for any publicly screened film in its database. A weighted average of all the votes (122k+ users in the case of “Venom”) is then calculated to give the single rating that we see on the film’s main page. IMDb explains that a weighted mean is used instead of the arithmetic mean, in order to reflect the vote average more accurately by accounting for vote stuffing. IMDb does not explain how this weighted average is formulated, but assures that the same calculation is used for all films.

The Tomatometer and Audience Score for “Venom (2018)” on Rotten Tomatoes as of 18 Nov 2018.

On the other hand, the popularly used Tomatometer from Rotten Tomatoes is a percentage figure. Is this percentage figure an average score out of 100? This might be shocking to some, but the answer is no. The percentage figure is simply the proportion of approved critics who have given the movie a positive rating. For example in the case of “Venom”, 29% is calculated by taking 81 ‘Fresh’ reviews divided by a total of 282 reviews counted. The status of the movie is further dichotomised into ‘Fresh’ or ‘Rotten’ using a threshold of 60% (see image below).

Threshold of 60% deciding whether a movie is ‘Fresh’ or ‘Rotten’.

There are three points to take note about the Tomatometer:

  1. It only takes into account the reviews of a small group of approved critics.
  2. The percentage figure from a sample of 100 critics vs the percentage figure from a sample 1000 critics are not differentiated by their accuracy.
  3. A movie with an average rating of 6.0/10 vs another movie with an average rating of 9.0/10 can have the same percentage figure, which is again not differentiating.

But besides the Tomatometer, Rotten Tomatoes also display an Audience Score, which is also a percentage figure. Similarly, the Audience Score percentage figure reflects the proportion of Rotten Tomatoes users who have given the movie at least 3.5 stars out of 5. The same threshold of 60% is used to determine if the popcorn bucket is full or tipped over (see image below).

Threshold of 60% deciding whether popcorn bucket is full or tipped over.

You should probably see by now that the IMDb rating cannot be directly compared against the Tomatometer or even against the Audience Score on Rotten Tomatoes. If any comparison was to be made, it would at least require the average ratings of Rotten Tomatoes to be used. In the case of “Venom”, the IMDb rating of 7.0/10 should then be compared against the Rotten Tomatoes average critic rating of 4.4/10 or the average user rating of 8.6/10 (4.3/5 multiplied by 2 to put them on the same scale). But one should still bear in mind, that the IMDb rating for “Venom” is based on 122k users, while the average critic rating of Rotten Tomatoes is based on 282 critics.

What does this have to do with the Wisdom of Crowd?

Before explaining how movie reviews are related to the wisdom of crowd, let’s first revisit the history of the wisdom of crowd effect, and find out how it actually works.

The wisdom of crowd effect was first demonstrated in 1906 by statistician Francis Galton (1907) when he examined the estimates of an ox’s weight in a weight-judging competition held at the West of England Fat Stock and Poultry Exhibition [1].

Since communication with one another would not have helped the participants in winning, Galton decided that the 800 estimates collected should be as independent and unbiased as possible. After discarding 13 estimates that were considered to be invalid, he ranked the rest of the estimates and found the median to be 1207 lb., which was 9 lb. or 0.8% more than the true weight of the ox.

Kenneth F. Wallis (2014) of University of Warwick revisited Galton’s drafts and noted that the actual median and true weight were supposed to be 1208 lb. and 1197 lb [2]. respectively. Nonetheless, the central tendency was still amazingly close to the true weight of the ox. In fact, statistician Reginald H. Hooker (1907) calculated the mean of the estimates and proved that it did converge with the true weight of the ox [3].

How does the central tendency achieve such an accurate estimate of the true value? From a Thurstonian perspective, a true value can be derived by removing any systematic and random errors in one’s estimate (Harries & Harvey, 2000) [4].

Averaging removes random errors from over- and underestimations.

It is possible to cancel out random errors in different estimates through averaging, especially if the estimates were made independently. This happens through a mechanism known as bracketing (Soll & Larrick, 2009), where the true value falls in between the estimates, and the absolute deviation of the average does better than the mean absolute deviation of the estimates [5]. Extending this idea to a situation where there are many more estimates, it becomes more likely for the estimates to fall on both sides of the true value, such that the deviations cancel out one another and the resulting average has a small absolute deviation.

However, the resulting average estimate will still not converge with the true value, unless that average estimate does not have any remaining systematic errors. This brings us back to the comparison between IMDb and Rotten Tomatoes.

Systematic errors in Rotten Tomatoes?

As mentioned earlier, IMDb uses a weighted average on all its registered users’ votes to calculate its movie ratings, which is in essence using the wisdom of crowd to determine how good or bad a movie is. You might ask, why use a weighted average instead of a simple average? If you refer back to the rating distribution of “Venom”, you will notice that the votes are concentrated from scores 6 to 10. This is what is known as an asymmetrical or skewed distribution.

Different positions of the mode, median and mean in an asymmetrical distribution.

If the distribution is symmetrical, using the mode, median or mean as the central tendency would not make too much of a difference. This was likely to be the case in Galton’s ox-weight estimation as the median and mean were not too far apart. However, in an asymmetrical distribution, the simple mean is often influenced by extreme values, and does not reflect the concentration of values on one side of the distribution. IMDb most likely uses the weighted average to account for this asymmetrical distribution, and so the simple average has been adjusted to give a more accurate reflection.

This should establish how IMDb’s scoring system attempts to remove as much random errors as possible. It is hard to say if systematic errors are still present, but with 122k+ voters, that should hopefully take care of any bias in representativeness. But where does that leave Rotten Tomatoes with only 282 approved critics? First of all, we already know that the percentage figure used in the Tomatometer is not an average rating, but simply the proportion of critics who gave the film a positive review. Secondly, even if we were to use the average critic rating on Rotten Tomatoes, we have to ask ourselves, are the approved critics on Rotten Tomatoes representative of the average movie-goer?

Systematic errors in Rotten Tomatoes may still be present even after removing random errors.

The most likely reason for the discrepancy between IMDb’s and Rotten Tomatoes’ ratings is because one reflects the general audience’s views while the other reflects the preferences of a very selected group. If we look at the average user ratings on Rotten Tomatoes, it is in fact closer to IMDb’s rating than the average critic rating. In other words, the discrepancy is probably a manifestation of the systematic bias in the critic reviews, which explains why they are not a consistent indicator of how well a movie will do in the box office. After all, despite Rotten Tomatoes’ negative critic reviews, “Venom” still outperformed many other popular superhero movies like “Deadpool” and “Logan”.

* * * * * * * * * *

Does that mean that the ratings from Rotten Tomatoes are not useful? Well… not exactly. If you identify with the views of the critics on Rotten Tomatoes, it just means that you share the same systematic bias as them, so their ratings will probably be more useful for you. But the main takeaway from this discussion is that the wisdom of crowd is still very relevant in today’s context. The bracketing mechanism that worked for estimating the ox’s weight in 1906, is still able to make a good estimate of the film’s true receptivity. This should serve as some food for thought the next time you are looking around for some advice on what movie to catch.

References:

  1. Galton, F. (1907). Vox populi (The wisdom of crowds). Nature, 75(7), 450–451.
  2. Wallis, K. F. (2014). Revisiting Francis Galton’s forecasting competition. Statistical Science, 29(3), 420–424.
  3. Hooker, R. H. (1907). Mean or median. Nature, 75, 487–488.
  4. Harries, C., & Harvey, N. (2000). Taking advice, using information and knowing what you are doing. Acta Psychologica, 104(3), 399–416.
  5. Soll, J. B., & Larrick, R. P. (2009). Strategies for revising judgment: How (and how well) people use others’ opinions. Journal of Experimental Psychology: Learning, Memory, and Cognition, 35(3), 780.

--

--

YS Chng

A curious learner sharing knowledge on science, social science and data science. (learncuriously.wordpress.com)