Computing a Bayesian Estimate of Star Rating Means

District Data Labs
District Insights
Published in
17 min readDec 17, 2017

By Benjamin Bengfort

Consumers rely on the collective intelligence of other consumers to protect themselves from coffee pots that break at the first sign of water, eating bad food at the wrong restaurant, and stunning flops at the theater. Although occasionally there are metrics like Rotten Tomatoes, we primarily prejudge products we would like to consume through a simple 5 star rating. This methodology is powerful, because not only does it provide a simple, easily understandable metric, but people are generally willing to reliably cast a vote by clicking a star rating without too much angst.

In aggregate, this is wonderful. Star ratings can minimize individual preference because the scale is not large enough to be too nuanced. After enough ratings, it becomes pretty clear whether or not something is a hit or a flop. The problem is, however, that many ratings are needed to make this system work because the quality of a 5 star rating depends not only on the average number of stars but also on the number of reviews.

Consider the following two items:

  1. Presto Coffee Pot — average rating of 5 (1 review).
  2. Cuisinart Brew Central — average rating of 4.1 (78 reviews).

Which item should be listed first in a list of items sorted by rating? It would seem clear that the Brew Central has more reviewers and that many of them are happy with the product. Therefore, it should probably appear before the Presto even though its average rating is lower.

We could brainstorm a couple of strategies that would give us this outcome and would help us deal with items that have an insufficient number of ratings when we come across them in the future:

  • We could simply list the number of reviews in addition to the average rating. However, this doesn’t help with sorting the products.
  • We could set a floor of n number of ratings before a product is shown, but this decreases the likelihood that a product will be reviewed because it remains unseen. Not only that, but how do we choose n?

Instead, we need some empirical metric for the average rating that embeds the number of reviewers into the score. Most methodologies in use today use Bayesian estimation, which takes into account the fact that we don’t have enough data to make an estimation via the mean and also incorporates all the data we have about other observations.

In this post, we will consider how to implement a Bayesian estimation of the mean of star reviews with a limited number of observations. The estimation is useful for recommender services and other predictive algorithms that use preference space measures like star reviews. It is also a good crash course in Bayesian estimation in general for those that are interested.

Before we get going, I would like to unashamedly cite Paul Masurel’s article Of Bayesian Average and Star Ratings as the inspiration for this post. This post will review some of his work and look more specifically at the Movielens data set of star reviews. If you’re interested in more math, please read his post!

Our Movie Ratings Dataset

In this post, we will be working with the MovieLens data set, which contains 100,000 ratings from 1,000 users on 1,700 movies. This data set is freely available for download from GroupLens.org, the official site of Social Computing Research at the University of Minnesota. A simple download is all that’s required for ingestion, but some wrangling needs to occur.

The first thing we need to do is join the u.data and the u.item data sets into a single, denormalized data set that contains the user_id, movie_id, rating,and title fields. We can use pandas easily enough to do this for us. We won't need to do this more than once, so we can simply use an interpreter to perform our data wrangling. In the same directory as your data, type python to open up the interpreter.

>>> import pandas as pd>>> udata = pd.read_csv('u.data', names=('user_id', 'movie_id', 'rating'), sep='\t', usecols=[0,1,2])>>> uitem = pd.read_csv('u.item', names=('movie_id', 'title'), sep='|', usecols=[0,1])>>> ratings = pd.merge(udata, uitem, on='movie_id')>>> ratings.to_csv('ratings.csv', index=False)

This block of code loads two Pandas DataFrames from the the u.data TSV (tab separated values) file which contains user and ratings information and the u.item file which contains movie information and is pipe delimited. Using the pd.merge function we can left join the two data sets on the movie_id to produce a denormalized data set that we write out to another CSV file.

We will use the resulting ratings.csv file for the rest of our analyses. To ensure that we can create a quality and extensible analysis of the movie ratings, create a file to add all of your Python code. Moreover, we will create a class to wrap all the work we expect to do on these ratings. In a file called stars.py,add the following code:

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
from operator import itemgetter
RATINGS = os.path.join(os.path.dirname(__file__), 'ratings.csv')class Ratings(object): def __init__(self, path=RATINGS):
self.path = path
self.load()
def load(self):
self.data = pd.read_csv(self.path)
def __str__(self):
return str(self.data.head())
if __name__ == "__main__":
ratings = Ratings()
print ratings

After importing all the needed libraries (the repository for this post contains a requirements.txt file with all of the external dependencies), this code creates a relative path to the CSV file, presuming that the CSV and the Python script are in the same directory. Our class initially does two things. First, it loads the data into a DataFrame and when printed, it reports the first several rows of that file. Using the ifmain block of code, we can test our work, execute the file by typing python stars.py on the command line. If everything is working you should see the first few rows from the ratings CSV.

Some Initial Data Exploration

With the logistics out of the way it’s time to begin our analysis. Let’s find the top 10 movies by average rating. Add the following methods to the class:

class Ratings(object):    ...    @property
def movies(self):
"""
Returns the data grouped by Movie
"""
return self.data.groupby('title')
def get_means(self):
return self.movies['rating'].mean()
def get_counts(self):
return self.movies['rating'].count()
def top_movies(self, n=10):
grid = pd.DataFrame({
'mean': self.get_means(),
'count': self.get_counts()
})
return grid.ix[grid['mean'].argsort()[-n:]]
if __name__ == "__main__":
ratings = Ratings()
print ratings.top_movies()

When you execute stars.py again, it is quickly apparent that there is a problem with our current approach. Movies with high average ratings but very few reviewers are listed first. The top movies (Aiqing Wansui (1994), Star Kid (1997), Santa with Muscles (1996), and more) all have a mean rating of 5 stars but also significantly less than 10 reviewers each.

count  mean
title
Aiqing wansui (1994) 1 5
They Made Me a Criminal (1939) 1 5
Great Day in Harlem, A (1994) 1 5
Saint of Fort Washington, The (1993) 2 5
Entertaining Angels: The Dorothy Day Story (1996) 1 5
Someone Else's America (1995) 1 5
Star Kid (1997) 3 5
Santa with Muscles (1996) 2 5
Prefontaine (1997) 3 5
Marlene Dietrich: Shadow and Light (1996) 1 5

In order to get a feel for our data, let’s visually compare the mean rating to the counts. Add the following code to your class and adapt the ifmain statement accordingly:

class Ratings(object):    ...    def plot_mean_frequency(self):
grid = pd.DataFrame({
'Mean Rating': self.movies['rating'].mean(),
'Number of Reviewers': self.movies['rating'].count()
})
grid.plot(x='Number of Reviewers', y='Mean Rating', kind='hexbin',
xscale='log', cmap='YlGnBu', gridsize=12, mincnt=1,
title="Star Ratings by Simple Mean")
plt.show()

When you plot the simple mean against the number of reviews for the entire movie ratings data set, a pattern begins to clearly emerge. The result of the plot should appear as follows:

This figure shows that the average rating of a movie is actually slightly higher than 3. There is a strong cluster of movies with approximately 100 reviewers whose simple mean rating surrounds 3.5, not 3. This contrasts with the left side of the figure, where you can clearly see that the mean is 3 when there are very few reviewers.

The figure visually calls our attention to a problem. When there are less than 10 reviewers, there are dramatic tails that ensure equally high and low reviews. If there are less than 10 reviews, the probability is much higher that a movie will have an average rating of 1.0, whereas after 10 reviews, the figure shows that it is extremely unlikely to see a movie with an average rating less than 1.5 or greater than 4.5.

We can hypothesize two things from this plot that allow us to make decisions for our specific data set. First, we become confident that a mean rating is good so long as it has between 50 and 120 reviews. Secondly, we know that the average review is between 3.0 and 3.5 when a movie has that number of reviews. Using these two hypotheses, we can come up with a rule that will allow us to rank movies according to both the mean rating the movie has received as well as the number of reviewers that have reviewed it. We can formalize these hypotheses with Bayesian estimation.

Bayesian Estimation

Bayesian estimation attempts to codify our hypotheses by allowing us to forgo computing a direct value from a limited number of observations and instead creates a probability distribution that describes the observable space more completely. We can then use this probability distribution to come up with a better estimate for our computation. That probability distribution can also be used to assign confidences or bounds that can lead to better decision making.

As you might (hopefully) remember from your statistics classes, or from this DDL post, in order to estimate probabilities you generate observations.

Let’s look at a simple example. Suppose we want to estimate the likelihood that someone will give a movie a thumbs up. We would initially assign each person a probability X and initialize it to 0.5probability that they will give the movie a thumbs up. We would then go stand outside a theater where that movie was playing and ask people leaving the theater that saw that movie whether they would give the movie a thumbs up or a thumbs down. The result would be a collection of observations that we would use to update our X, making it more accurate. The updated X based on O, the sequence of observations after N tests, is called the posterior distribution.

Bayes formula tells us how to compute the probability of X given O:

P(O) is the probability of observation and our prior, P(X), is what we believe the distribution of the probabilities X is before seeing any data. When you don't know anything about the probabilities, you can start with the assumption that there is an underlying uniform distribution where all possible values of the random variable X are equally likely (other choices are possible, but that's for another day). Therefore, we are only interested in the proportionality relationship between P(X|O) (read "the probability of X given O") and P(O|X) (read "the probability of O given X"). The probability of O given X (P(O|X)) is called the likelihood and can be computed by taking the product of the probability of each observation (assuming they are independent):

For each observation, the probability of a thumbs up is X and the probability of something other than a thumbs up is 1-X, so if we observe K thumbs up for N observations, we can derive the posterior probability as follows:

This is called the binomial distribution. As we collect more observations, the distribution becomes increasingly refined and, with that refinement, we can compute an interval (usually called the credible interval) within which the value probably lies.

Back to our thumbs-up and down example; consider that as we’re standing outside of the movie theater. As our first person comes out, we get a thumbs up! Our posterior distribution is based off of K=1, N=1, which is plotted in the first pane of the figure below. After 5 observations we have 4 thumbs up and one thumbs down, after 10 observations we have 7 thumbs up, after 25 we have 18, after 50 we have 37, and after 100 we have 72 (this is a really big theater)! The distributions at each point of the observation are plotted in the panes below:

As you can see the more observations we get, the more refined our posterior distribution becomes, and the more confident we are in the interval surrounding the value with maximum probability of the distribution. In fact, for this task, I created an observation data set that had a probability of 0.68 that the reviewer would give a thumbs up; given the vagaries of random number generation, the probability of 0.72 after 100 observations provides a posterior that includes a high probability for 0.72.

Note: The code to generate these posterior distributions given some number of observations is available here.

Joint Distributions

To use Bayesian estimation to compute the posterior probability for star ratings, we must use a joint distribution. We are not estimating the distribution of some scalar value X but, rather, the joint distributions of the probability estimate of whether or not the reviewer will give the movie a 1, 2, 3, 4, or 5 star rating (not just a simple thumbs up or down). In this case, the random variable is a categorical distribution because it can take some value within {1,2,3,4,5} with probabilities as follows:

Our reasoning for the binomial distribution still applies, however, and we can compute the proportionality of the relationship between our probability and observations in a similar way, replacing X with p_1 , p_2, p_3, p_4, p_5. Our likelihood is still the product of probabilities of each observation, and each individual observation is given by an associated probability. We can compute our posterior probability with N observations for K_1, K_2, K_3, K_4, K_5 categories as follows:

This is not a binomial distribution, but rather a multinomial distribution with a parameter that can be expressed as follows:

If we include our prior as a distribution of the exact same form in the proportionality equation (e.g. a Dirichlet distribution with parameter ⍺0) we can factorize the distribution as follows:

This is another Dirichlet distribution with another parameter, ⍺1 that can be characterized as follows:

And this is the distribution which we can use to estimate the posterior likelihood of our joint probability given K_i categorical observations, as we did previously using the binomial distribution.

Expected Averages

Because we are primarily interested in the estimate of the average star rating, we can rephrase our problem as, “What is the expected value of the average rating given a posterior in the shape of our Dirichlet distribution?”
The average value of a categorical random variable is the weighted average of the values of the random variable weighted by the respected probability values. In other words, the sum of the probability of getting a star, given our observations, multiplied by that star’s value for each star value from 1 to 5. So, for the categorical variable of star ratings, the average value would be:

The expected value of the average rating based on the posterior is then computed for our star ratings as follows:

Using our Dirichlet distribution we can compute the probability of a star value given our observations as the ratio of the Dirichlet parameter for that star to the sum of the Dirichlet parameters:

This probability can be plugged into our expected value with a little simplification as follows:

Take a minute to digest this formula, and expand out the summations if you need to. There are only 5 things you can sum, one for each star. In his post, Paul Masurel goes into a bit more detail and summarizes the Bayesian average as follows:

  1. Select some m that represents a prior for the average of stars.
  2. Select a C that represents our confidence in the prior.

This fits in very well with the context of our hypothesis: m is the average rating we can expect for movie, given some minimum number of observations, C. Simply eyeballing our hexagonal histogram, we could estimate m to be 3.25 and C to be 50. The Bayesian average would then be:

Replacing N for the number of reviews and C and m for our confidence and prior respectively. Let's update our Ratings class to reflect these new insights:

class Ratings(object):    def __init__(self, path=PATH, m=None, C=None):
self.path = path
self.prior = m
self.confidence = C
self.load()
def bayesian_mean(self, arr):
if not self.prior or not self.confidence:
raise TypeError("Bayesian mean must be computed with m and C")
return ((self.confidence * self.prior + arr.sum()) /
(self.confidence + arr.count()))
... def get_bayesian_estimates(self):
return self.movies['rating'].agg(self.bayesian_mean)
def top_movies(self, n=10):
grid = pd.DataFrame({
'mean': self.get_means(),
'count': self.get_counts(),
'bayes': self.get_bayesian_estimates()
})
return grid.ix[grid['bayes'].argsort()[-n:]]

If you run the updated code, you should get the following top 10 ratings:

bayes  count      mean
title
One Flew Over the Cuckoo's Nest (1975) 4.125796 264 4.291667
Raiders of the Lost Ark (1981) 4.145745 420 4.252381
Rear Window (1954) 4.167954 209 4.387560
Silence of the Lambs, The (1991) 4.171591 390 4.289744
Godfather, The (1972) 4.171706 413 4.283293
Usual Suspects, The (1995) 4.206625 267 4.385768
Casablanca (1942) 4.250853 243 4.456790
Shawshank Redemption, The (1994) 4.265766 283 4.445230
Star Wars (1977) 4.270932 583 4.358491
Schindler's List (1993) 4.291667 298 4.466443

Now that’s much better! This list definitely seems like a list of the most popular movies, and we have significant counts for each movie. Note that the Bayesian estimate is pretty close to the computed mean, but they are not exactly the same. The Bayesian estimate embeds a prior and a likelihood (from observations) to the mean. This means that for movies that have less than C ratings, the estimate is strongly influenced by the prior. However, for movies that have more than C ratings, the observed mean has more influence than the prior, but the prior still has a little pull. As you can see, each of the estimates is slightly less than the mean, heading towards our likelihood, m.

Computing Dirichlets

If you begin to play around with the parameters for our estimation, m and C you begin to note that it changes the top 10 list slightly. Actually these parameters are very important, especially as we get to the middle of our cluster! Instead of just eyeballing them, we can instead compute the distribution of the Dirichlets from our observations by using a prior that is the number of expected votes for each category.

PRIOR = [2,2,2,2,2]class Ratings(object):    ...    def dirichlet_mean(self, arr, prior=PRIOR):
"""
Computes the Dirichlet mean with a prior
"""
counter = Counter(arr)
votes = [counter.get(n, 0) for n in range(1, 6)]
posterior = map(sum, zip(votes, prior))
N = sum(posterior)
weights = map(lambda i: (i[0]+1)*i[1], enumerate(posterior))
return float(sum(weights)) / N def get_dirichlet_estimates(self):
return self.movies['rating'].agg(self.dirichlet_mean)
def top_movies(self, n=10):
grid = pd.DataFrame({
'mean': self.get_means(),
'count': self.get_counts(),
'bayes': self.get_bayesian_estimates(),
'dirichlet': self.get_dirichlet_estimates()
})
return grid.ix[grid['dirichlet'].argsort()[-n:]]

This methodology uses a uniform prior of 2 votes for each star rating, which is represented as a list of length 5 (one for each star rating). Then we collect the counts of each vote (e.g. the number of votes for each star rating) and add the prior values to the counts, this gives us our posterior distribution. We compute the weights as the sum of the product of the number of votes and the star value, then return a mean by dividing the weighted sum by the total number of votes received.

Note: Sorry about the complex Python code; all the zip and map functions do is manipulate multiple lists together combining and mapping functions to every value as necessary. If you're unsure what is going on here, a great strategy is to print out each intermediary step to see what the counter looks like, what the votes and posterior look like, etc.

The top 10 movies if you run this code again now look as follows:

bayes  count  dirichlet      mean
title
One Flew Over the Cuckoo's Nest (1975) 4.125796 264 4.244526 4.291667
Godfather, The (1972) 4.171706 413 4.252955 4.283293
Silence of the Lambs, The (1991) 4.171591 390 4.257500 4.289744
Star Wars (1977) 4.270932 583 4.335582 4.358491
Usual Suspects, The (1995) 4.206625 267 4.335740 4.385768
Wrong Trousers, The (1993) 4.104167 118 4.351562 4.466102
Close Shave, A (1995) 4.108025 112 4.368852 4.491071
Shawshank Redemption, The (1994) 4.265766 283 4.395904 4.445230
Casablanca (1942) 4.250853 243 4.399209 4.456790
Schindler's List (1993) 4.291667 298 4.418831 4.466443

Clearly we have achieved the same effect as before: only movies with significant votes are ranked in the top ten. However, using this methodology, the means are significantly closer to the computed mean, which has reordered the top ten list and we don’t have to guess at a prior mean and confidence, but can instead simply use a prior that casts a uniform number of votes to all ratings.

Conclusion

Congratulations on making it through this long post! Hopefully you enjoyed the refresher of Bayesian statistics, and were able to dig into the Python code to analyze the MovieLens data set. At the very least, this post shows a data exploration and analysis technique using class-based Python and a hypothesis that a Bayesian model can be used to estimate or refine star rating predictions. Sometimes there is a disconnect in data science between writing code and doing some sort of statistical modeling, and we hoped to bridge the gap in this post.

Specifically, in this post we discussed the application of Bayesian techniques to estimate and rank preferences by 5 star rating. We discovered that using a simple mean is not enough because the number of observations contributes to the quality of a 5 star rating. Our hypothesis was that we could use a Bayesian posterior, specifically a Dirichlet posterior on our joint distribution of categorical ratings to refine our estimates given increasingly refined observations. We wrote a lot of code to view the top rated movies in the MovieLens data set according to the techniques we discussed.

Certainly these techniques can be improved further from here. For example, your ranking could implement a cooling factor, where newer votes contribute more significantly to the score of the movie than older votes. This would help restrain summer blockbusters and allow classics to continue to circulate to the top over time because people will forget about the blockbuster, but perhaps continue to rate the classics!

Helpful Links

I hope you’ve enjoyed the post! I mentioned a lot of links and folks throughout the post, but have compiled them here for you to review at your leisure.

The code used in this post can be found on Github at the District Data Labs account, as well as the requirements.txt file that specifies all of the required dependencies:

  1. The stars.py code on Github
  2. Requirements
  3. Code for Thumbs Up Distribution

The data comes from GroupLens:

  1. MovieLens 100k

The inspiration for this post was Paul Masurel, and further reading can be found in his post (listed first) as well as the other posts listed below:

  1. Of Bayesian Average and Star Ratings
  2. How Not to Sort by Average Rating
  3. Hacker News
  4. How to Rank Products Based on User Input
  5. How to Sort by Average Rating

District Data Labs provides data science consulting and corporate training services. We work with companies and teams of all sizes, helping them make their operations more data-driven and enhancing the analytical abilities of their employees. Interested in working with us? Let us know!

--

--

District Data Labs
District Insights

Data science consulting and corporate training. Take your analytics to the next level.