How an Economist Thinks about Yelp Reviews

Brad Chattergoon
The Renaissance Economist
24 min readJan 8, 2024
Generated by DALL-E

The retail world is increasingly driven by online reviews for products, restaurants, and services. These can often be in-depth expert reviews by bloggers or specialty publications like TechRadar, Wired, or AppleInsider, but they are also increasingly reviews by crowds, sourced and managed via platforms like Yelp, Google Reviews, Amazon, and more.

In fact, these crowdsourced reviews can feel more relevant because they are more likely to be from people like us describing what our average Joe or Jane experience of the product might be like. Consider TechRadar’s “audio performance review” of the Apple AirPods Max. The summary at the top of the section reads: “A wide, well-balanced soundstage”, and “Support for Hi-Res Audio files is limited”.

Compare this to the most helpful review on Amazon for the audio: “The sound quality on these is an absolute beast, and it’s by far the best pair of headphones I’ve ever used. The audio is clear, and you can hear highs, lows, and everything in between. However, I will be the first to admit, while I am musically inclined and love listening to audio, I’m by no means an Audiophile or claim to be able to hear what types of sounds are being emitted cleanly and loudly, and which aren’t.”

If you’re a modal music enjoyer like myself, then the crowdsourced review answers the question you have as a customer: “Will I enjoy the sound of my music with these?”, while the expert review likely has you asking: “So, should I buy these or not?”

Crowdsourced reviews are, therefore, a major input into consumer decision making and the following meme should resonate with almost everyone who has ever looked at an online crowd sourced review site before making a consumption decision.

Why is that?

For the image on the left you can probably verbalize a sentiment along the lines of “it could be manipulated” or “what if it’s just a few people who are friends of the business?”, while the image on the right gives you a sense that a lot of people have tried the product or service and there is some reliability in the rating.

It turns out that mathematics is on your side! There is a precise way to mathematically represent this intuitive feeling that you have about the reliability of the reviews. Importantly, it can also help answer a related question: at what number of reviews between 19 and 2287 should you switch from “not reliable” to “reliable”?

In order to ground the discussion I’m going to focus the ideas on one of the most impactful leaders in online reviews, Yelp.com.

Some background on yelp.com

According to a paper by Michael Luca of Harvard Business School and Georgios Zervas of Boston University Questrom School of Business published in 2016, “Yelp alone contains more than 70 million reviews of restaurants, barbers, mechanics, and other services”, and at the time of the study “receives approximately 130 million unique visitors per month”.

In another paper by Luca, he finds the following effects from Yelp ratings based on data from restaurants in Seattle:

  • A one-star increase in a Yelp rating leads to a 5–9 percent increase in revenue.
  • The increase in revenue is primarily for independent restaurants with little to no effect on chains.
  • Even further, he finds that chain restaurants have declined in market share as Yelp has increased its restaurant listings.

These findings tell us the following story: As Yelp has increased its repository of reviews, consumers have become empowered to find information about the quality of restaurants which are not able to rely on a larger brand reputation, such as the brand reputation possessed by chain restaurants, and this has reduced consumers’ risk aversion to trying a restaurant of unknown quality.

In other words, we’re more willing to try the niche Thai restaurant downtown instead of Chili’s when we can rely on others’ positive account of their experience at that Thai restaurant.

Clearly reviews and ratings have a meaningful impact on which businesses consumers choose to patronize, but how do consumers make these choices and are they optimal?

The Mathematics Behind the Average Rating

For this part of the discussion, let us make a couple of simplifying assumptions: all reviews are real and represent the reviewer’s true experience with a business, and consumers are just as likely to leave a Yelp review whether they have a good, bad, or neutral experience. We will discuss some details to consider if this is not the case later.

Yelp uses a familiar 5-star rating system for its listed businesses with 1 being the lowest rating possible and 5 being the highest. They have a suggested coding as follows:

  • 1 star — Not Good
  • 2 stars — Could’ve been better
  • 3 stars — OK
  • 4 stars — Good
  • 5 stars — Great

They then report an average review, and we again make a simplifying assumption that this is a straightforward simple average of all reviews ever reported on the Yelp business page.

How More Reviews Increases Reliability

Let us assume there is some “true” average rating for any given restaurant. In some sense, it is the average rating that we would find if we could open up the universe and peek at the underlying characteristics for that restaurant, or if every person in the world experienced and reviewed the restaurant. Those familiar with probability and statistics will know this as the “expected value”.

The first mathematical tool we need is called the Law of Large Numbers. The theorem states that as we observe values for a particular phenomenon, the simple average of these observed values, called the “sample average” or more commonly the “sample mean”, approximates the true mean when the size of the sample is very large and the approximation improves as the size of the sample increases.

A good example of where this is applied is in election polling. Let us say there is a hypothetical democratic state with 1 million voting age persons and with two political parties, the Lions and the Pandas. In this democracy everyone of voting age exercises their right to vote (as they should!) and so we will have 1 million votes in the upcoming election which will need to be counted once the voting period ends. Therefore there is indeed a “true” mean which is the fraction of the population that votes for the Pandas (or equivalently we can frame it as the fraction that votes for the Lions).

Imagine we work for a polling organization that wishes to generate a prediction for the election outcome ahead of time. The way to guarantee we are 100% correct in our prediction is to go talk to all 1 million people ourselves, effectively conducting a simulated election, but we do not have the necessary time or people to do that. Instead we talk to a sample of the population.

Let’s say we talk to 4 people (we will refer to this as 4 “observations”) and 3 of them say they are voting Panda while 1 says Lion. That gives us a sample mean of 75% of the votes for the Pandas. Now let’s say in one state of the world we pack it up and say “ok that’s enough”, then wait until election night and observe that the true vote allocation to the Pandas is only 40%. Time to start looking for a new job.

However, let’s say instead of packing it up at 4 people, we kept on interviewing people until we interviewed 100 people (i.e. got 100 observations). We find that 48 people are Panda supporters while 52 people are Lion supporters. We’re still not exactly on target but we are much closer! In technical terms, the sampling error is smaller, we’re only off by 8% instead of a whopping 35%.

Here is how a random sample of voters’ responses might approximate the true mean when the true mean is 0.4, 0.5, and 0.6.

As you can see there is some distance from the true mean in each case for a low sample size but the approximation eventually becomes, and stays, very close to the true mean, except in the case of 0.4 probability but this is due to chance in the simulation of the data.

The sample mean can change if we happened to interview a different set of people, which means that our predicted result can change dramatically if our sample is small. Here’s how the above diagram would look if we added in 3 more samples for each true mean.

As you can see there is some disagreement for each true mean across samples at small sample sizes but as the sample size increases the sample mean of all samples converges to the true value. This tells us that as the sample size becomes large, our sampling error reliably becomes small.

“But”, you may ask, “how small does it become?”

The second mathematical tool we use answers exactly this question.

How Reliable?

One of the most important, if not *the* most important, tools in statistics and quantitative science more generally is called the Central Limit Theorem (CLT).

In short, it says that under certain (fairly common) conditions the sample mean (i.e. sample average) follows a normal distribution, also known more commonly as “the bell-curve”.

In other words, in our hypothetical polling exercise above the sample mean, which is the fraction of people who are voting for the Pandas, follows a normal distribution (see Technical Note below).

The Normal Distribution isn’t just for Normies.

Arguably one of the most popularly known statistical distributions is the “bell-curve” whose official name is the Normal Distribution. It is a distribution that has a peak at the average value, which we call the “mean”, is symmetric about the mean, and has a smoothly decreasing value as we move away from the mean on either side. We refer to this decreasing property of the distribution as its “decay” and we refer to how quickly it decreases as its “rate of decay”.

The area under the curve represents the probability of a certain outcome. For instance consider the following Normal Distribution.

The total area under the curve is 1 since we scale the probability of all events to be equal to 1. Therefore the area under the left half of the curve is 0.5 as the curve is symmetric around the mean and for this normal distribution we have set the mean to be 0.

Next compare the difference between the previous normal distribution and the one below.

Having some trouble identifying the difference? I’ll overlay them on top of one another which should help.

The “spread” of the curves are different! In the first normal distribution, the value of the curve gets very close to 0 as we get out to its values at -3 and 3, while in the second normal distribution the value of the curve does not get close to 0 until around -6 and 6.

In the language we introduced above, the rate of decay for the two distributions is different even though the means are the same and the overall behavior of the two curves is the same. We have a specific measurement for this rate of decay of the curve which we call the “standard deviation from the mean”, i.e. how far does the curve go out from the mean of the data. We usually shorten it to just “standard deviation”.

The mean and the standard deviation are such commonly used terms in probability and statistics that we have specific symbols to represent them. We borrow letters from the Greek alphabet and represent the mean by μ, pronounced “mew”, and the standard deviation by σ, pronounced “sigma”.

The standard deviation of the first normal distribution is smaller than in the second; in fact, the standard deviation of the second is exactly 2 times the standard deviation of the first. Specifically, for the first distribution the standard deviation is equal to 1 while for the second the standard deviation is 2. Every normal distribution is described fully by its mean and standard deviation and we can specify a normal distribution by writing N(μ, σ). Note that the usual convention is to specify a normal distribution by the mean and the squared standard deviation σ2 (pronounced “sigma-squared”) and denoted as N(μ, σ2). We call the squared standard deviation the variance.

The standard deviation (σ) of the normal distribution has some “standard” *wink wink* properties as well. If we look at the area under the curve at 1 standard deviation away from the mean on either side of the mean, this area is equal to just over 0.68 which corresponds to 68% of the probability identified by the distribution being 1 standard deviation’s (1σ) distance away from the mean. This information is demonstrated visually below.

Recall that the standard deviation of the first normal distribution is 1 and for the second it is 2.

There are also corresponding probability values at 2σ and 3σ distance. The probability identified at 2σ distance on each side of the mean corresponds to just over 95% while at 3σ on each side there is about 99% probability covered by the curve. You can use this information to understand why we see the value of the first distribution’s curve gets very close to 0 at -3 and 3 while the second distribution’s curve gets very close to 0 at -6 and 6.

Technical Note: Strictly speaking, the distribution of the sample mean tends towards a normal distribution as the size of the sample goes to infinity. I will continue to say that the distribution of the sample mean “follows” a normal distribution for brevity and simplicity, but be aware of this distinction. Further, this result assumes that the standard deviation of the phenomenon being measured is known which is usually not the case. These technical conditions leads to use of a similar distribution called the Student-t distribution which itself tends towards a normal distribution. As this article is meant to be conceptual, for the rest of the article I will simply use the normal distribution for the discussion, but the reader should be aware that there is a bit more going on here.

Now we return to the Central Limit Theorem.

The distribution of the sample mean follows a normal distribution with mean equal to the true mean as illustrated in our example from the Law of Large Numbers, but the question then becomes: What is the standard deviation/variance? This captures the same idea as asking, “how reliably does the sample mean approximate the true mean?”

When determining the standard deviation we first consider the variance and then take the square root. The variance of the sample mean is the variance of the underlying distribution divided by the size of the sample. This may be most easily understood by an example.

In our polling example, let us assume that the true mean of 0.5, i.e. 50% of the population will vote for the Pandas, then the probability that any randomly sampled individual person is voting for the Pandas is 0.5. This type of distribution is called a Bernoulli distribution, where a person is either voting for the Pandas or not, and has a variance equal to the probability of someone voting for the Pandas multiplied by the probability of someone not voting for the Pandas. In this case that is 0.25, so the standard deviation is also 0.5. If the true mean were 0.4 then the variance would be (0.4 * 0.6 =) 0.24 and the standard deviation would be approximately 0.49.

To get the variance of the sample mean we divide this variance by the size of the sample, so if we have 4 observations in the sample then the variance is 0.25 divided by 4 which is 0.0625 or one sixteenth and the standard deviation of the sample mean is the square root of 0.0625 which is 0.25 or one quarter. If we have 100 observations then the variance is 0.25 divided by 100 which is 0.0025 and the standard deviation is 0.05.

Since we are dividing by the size of the sample and then taking the square root to obtain the standard deviation, the reader may have already concluded that the standard deviation scales by the square root of the size of the sample; 4 observations halves the underlying distribution’s standard deviation, while 100 observations scales it down by a factor of 10. This is one of the main inputs into understanding how reliable the average rating is: the first set of observations increase the reliability of the rating very quickly but as more reviews come in they have a smaller and smaller impact on reliability. In other words, the 1000th review is not as valuable from a reliability perspective as the 20th review.

Here is what the distribution of the sample mean looks like for various sample sizes of the polling exercise when the true mean is 0.5. Recall that the distribution will be normally distributed with mean of 0.5 and standard deviation of 0.5 scaled by the square root of the size of the sample.

There is some nuance here. The distributions pictured represent what the sample mean would be distributed like for a sample size as indicated, the probability any randomly sampled voter is voting for the Pandas is still 0.5 and this distribution still only has 2 values, yes or no. Any given sample of the indicated size would have a sample mean distributed according to the corresponding plotted distribution.

To help illustrate this, let us assume the true mean is 0.5 in our polling exercise then here is what the simulated data looks like if I take 200 samples of size 25 and plot the sample mean for each of those 200 samples.

Here is what the data looks like if I do the same thing but also add 200 samples of size 100.

You should notice the similarity in trends between sample sizes of 25 and 100 in this figure and the trends in the figure with theoretical distributions above; more of the samples of size 100 have sample mean close to the true mean than sample of size 25.

We’re now ready to discuss the Average Rating!

The Average Rating

The average rating is, well, an average (simple average based on our assumptions). In other words if we have 5 ratings as follows: 5 stars, 4 stars, 5 stars, 3 stars, 3 stars, then the average is the sum, 20, divided by the number of ratings, 5, to get a 4 star average rating. This is the same as the “mean” of these ratings. Do these ratings also have a standard deviation? Yes they do!

As you might expect from a measurement whose full name is the “standard deviation from the mean”, the formula to calculate it involves measuring the distance from the mean. The procedure first calculates the squared standard deviation, i.e. the variance, and then takes the square root of the variance to get the standard deviation.

When we explored the standard deviation in the discussion of the CLT, we used a known property of the Bernoulli distribution that models the polling exercise, which is that the variance is the product of the probability of voting for the Pandas and the probability of not voting for the Pandas, and even further we spoke from the point of view of knowing exactly what that probability is.

In a real world situation where we are trying to determine the true mean by looking at observed data, we have to approach it a bit differently. Much like we have the “sample mean” which is the average of the ratings, we also have a “sample standard deviation” which we calculate starting with the “sample variance”. To calculate the sample variance we use the sum of the squared difference between the observations and the sample mean and then scale it by one less than the number of observations as follows:

For this example set of ratings we therefore find that the sample standard deviation is 1.

Technical Note: For this discussion I am going to equate the sample standard deviation to the true standard deviation so that we can directly use the CLT to look at normal distributions with mean equal to the sample mean and standard deviation equal to the sample standard deviation, but this is not technically correct. However, since this article is meant to be conceptual, I proceed with this flaw as the concept is the same if we adhere to all the technical details.

Now we’re ready to examine how to think about a Yelp rating.

Consider the rating distribution for a restaurant on Yelp as shown below.

There are 1379 total reviews, and the distribution is very loosely:

  • 5 stars: 45%
  • 4 stars: 25%
  • 3 stars: 15%
  • 2 stars: 5%
  • 1 star: 10%

And the average review is 4 stars. We can therefore calculate the sample mean and the sample standard deviation: sample mean = 3.90, sample standard deviation = 1.30. Yelp approximates the sample mean to the nearest half star for easier reporting which will be 4 stars in this case, aligning with the rating shown.

Now let’s hold the distribution fixed but manipulate the number of reviews. This won’t change the sample mean, it will only change the sample standard deviation. Here’s what the distributions of the sample mean will look like at different sample sizes for this restaurant’s ratings distribution.

As you can see, they are all centered at the mean of 3.90 but the rate of decay is increasing in the number of observations so that more and more of the probability gets concentrated at the mean. This is most easily observed by looking at the points where the corresponding curve goes to 0 which gets close to the center as the sample size increases. We’ve already discussed the intuition behind why the normal distributions are changing in this way, but now we need to develop an intuition for asserting at which point the number of reviews goes from unreliable to reliable; in other words at which point can we say we are confident in the reported average rating of 4 stars?

Since the ratings that we see are only a sample of the “true” distribution of ratings, there is uncertainty in what the “true” average rating is but via the Law of Large Numbers and the Central Limit Theorem we can assert that the average rating follows the normal distribution as discussed. This allows us to describe what we call the “confidence interval” for the true mean, which is the range of values for which we can be confident to some level the true mean falls into.

To develop this intuition let us consider just the curve that represents the distribution of the average rating (sample mean) in the case where we have only 10 reviews. You will recall from our discussion of the normal distribution that 95% of the probability is covered by the area 2 standard deviations (2σ) away from the mean on either side. Let’s show the the 95% confidence interval below in the shaded area.

As you can see the shaded area extends far enough from the mean (shown as a black line) to include values at 3.5 and 4.5, but also goes even further than these. This means that we have a reasonable likelihood that the “true mean” could be 3.5 stars or 4.5 stars, even though we would want to report 4 stars based on the sample mean. If we wanted to be 99% confident about the possible values of the true mean we would have an even larger range of 3σ on either side that also includes 3 stars.

This is how we can tell that for this sample mean, distribution of ratings, and sample size, the 4 star average rating is not reliable; we can’t say with 95% (or even better 99%) confidence that among the half star average ratings the only one possible at this confidence is 4 stars. In other words, because we have a reasonable likelihood that the true sample mean can be 3.5, 4.0, or 4.5 stars, we cannot reliably say that it is 4.0 stars.

Let us see what the 95% confidence interval for the sample mean with 200 observations looks like.

It is very easy to see that the only half star rating in the 95% confidence interval is 4 stars and so we can reliably say that the average rating (in half star intervals) is 4 stars. Since 10 ratings clearly is too few and 200 ratings is clearly well above the needed number, there must be some number of ratings in-between at which point we transition from not reliable to reliable.

If we look at the plot of all the distributions of various sample size, we see that the normal distribution which is most likely to be the point at which we transition among those sample sizes is the one based on 50 ratings. Let us plot that distribution with its 95% confidence interval.

By looking at where the shaded confidence interval ends we observe that it does not include either 3.5 or 4.5 and so we can be 95% confident that the true rating to the nearest half star is 4.0. Can we be 99% confident of this rating at 50 observations? This is what the 99% confidence interval looks like at this sample size.

The shaded region overlaps with a 3.5 star rating so our 99% confidence does not eliminate the possibility of a 3.5 star rating for this sample size and we therefore cannot consider the 4.0 star rating to be 99% reliable for 50 observations.

When can we have 99% confidence? Consider a sample size of 75 observations.

In this case the 99% confidence interval shown does not cover 3.5 so we are 99% sure at 75 observations that the “true” rating is not 3.5 or 4.5 (or any other half star rating) and therefore if we wish to use half star ratings as our scale then we can trust the true rating is 4.0 with 99% certainty. In more technical terms, there is no statistical difference between the true rating and a rating of 4.0 stars.

This finally gets us to a rule of thumb: if we want to be 95% sure that the rating shown on Yelp is accurate, then we want to see at least 50 reviews, but if we want to be 99% sure then we want to see at least 75 reviews. This identifies the point at which the average rating switches from unreliable to reliable depending on the specific level of reliability you want to have.

Note that this is just a heuristic to use but it is not a precise statement. To be more precise we would actually need to calculate the specific standard deviation based on the rating distribution and then re-do the calculations in the exercise above. Even further, 50 reviews and 75 reviews are easy round numbers to use in a heuristic but in reality the transition from 95% confidence to 99% confidence actual happens at 71 reviews, and we get to 95% confidence at 41 reviews for this particular distribution of the ratings. There may be other ratings distributions for which 75 reviews is insufficient for 99% confidence although 75 will be close to the specific cutoff that gives 99% confidence.

What about the simplifying assumptions?

As a reminder, here were the simplifying assumptions:

  1. All reviews are real and represent the reviewer’s true experience with a business
  2. Consumers are just as likely to leave a Yelp review whether they have a good, bad, or neutral experience
  3. The average rating is a straightforward simple average of all reviews ever reported on the Yelp business page

On (1): The main issue that this introduces into the analysis of reliability is that the reported average rating itself is very likely to be incorrect, inflated by positive fake reviews or deflated by negative fake reviews as we have seen in some cases lately (i.e. review bombing).

The part of the analysis that goes bad here is the assumption that as we get more reviews we tend toward to the true average rating; the Law of Large Numbers fails us. This is because we are no longer “randomly sampling” from all possible reviews but rather over-sampling from the positive (or negative) part of the distribution of ratings. This is called a “biased sample”.

The type of statistical analysis explored in this article cannot help us solve this problem as it does not have a method for accounting for reviews which are fake vs real. However, other statistical methods can help us identify which reviews are likely to be real or fake.

Amazon is one of the pioneers in trying to assess whether reviews are real or fake using machine learning tools that might identify trends in fake or suspicious reviews. For example, if there are a flood of positive reviews to a new product or business page that are similarly worded, all written within the same small span of time, and all from brand new accounts, this might indicate they are fake and online review sites can leverage these details to filter them out from the reviews used in the calculating the average rating.

Once these fakes are removed, our original analysis will apply.

On (2): It is a well documented fact that consumers tend to be more likely to leave a review if they have an extreme experience, either positive or negative, than if they have a moderate or neutral experience. This is another type of bias introduced into the sample of reviews. It also has an effect on the standard deviation of reviews.

Since the standard deviation calculation takes into account the distance from the mean, even if the sample mean is accurate in the sense that it obeys the Law of Large Numbers and trends toward the true mean as the number of reviews increases, the distance from the sample mean will be artificially large because we are sampling reviews at the extremes of the distribution.

There is also very little we can do in our analysis of average rating reliability to address this issue. In an HBR article by Nadav Klein of INSEAD, Ioana Marinescu of the University of Pennsylvania, and Andrew Chamberlain and Morgan Smart of Glassdoor, they explore incentives to encourage a more random sample of reviewers and find positive effects on reducing review bias via this method. There are other studies that similarly explore debiasing by inviting a more random sample of consumers to leave reviews and find positive outcomes.

On (3): The main issue here is that reviews can, in a sense, go stale. Consider a restaurant. Over time, the wait staff may change, management may change, and even the menu or chef may switch. In circumstances like these, the average rating may not be informative because the majority of reviews for the restaurant may in fact be for a “different” restaurant depending on how severe a change has on the experience.

There is a way to adjust for this within our analysis; we can calculate a weighted average instead of a simple average. This means that more recent reviews will count more for the average rating than older reviews. The standard deviation calculation would change to incorporate the weighting, but the overall concept would be the same.

Do consumers really internalize this calculation?

All this math and statistics is a great intellectual exercise but does this actually matter for how people actually behave in the real world?

Thanks to Michael Luca we have an answer! In one of his papers, Luca explores how revenue changes based on the average rating and the number of reviews. He first establishes a baseline effect from ratings, measured as the number of standard deviations from the mean rating of all restaurants in a sample from Washington state. He finds that a 1 standard deviation increase in the rating leads to an increase of 5.4% in revenue.

He then calculates how much more over this baseline a 1 standard deviation increase in the rating affects revenue depending on how many reviews the restaurant has. Here is what he finds:

Notice the big jump when we cross the threshold of 50 reviews. Unfortunately, this data does not help us identify the effect of 50 reviews because it includes review numbers greater than 50 as well so it only allows us to say something about below 50 and above 50. If the threshold is above 50 it might be obfuscated by the aggregation of these reviews into a single category. Based on our analysis, if consumers internalize the statistical analysis consciously or otherwise at a 95% confidence level, we would expect a jump at 50 reviews, but we can’t say precisely if this is happening. It could be that the trend would continue at 51–60 reviews, but this is the best data I could find.

Conclusion

We started off with a simple question: why do you agree with this meme? In order to answer it we’ve looked at two of the most important concepts in statistics, the Law of Large Numbers, and the Central Limit Theorem. These help identify why we don’t trust a small number of reviews and why we trust a large number, but they also help us understand at what point we might want to switch from not trusting to trusting.

Therefore the next time you see this very popular meme floating around online, you will have a better understanding of why so many people agree with it!

They also help us understand why polling people works. Perhaps the tweet below is no longer mysterious!

Thanks for reading! You can find me @bradchattergoon on Twitter and LinkedIn.

--

--

Brad Chattergoon
The Renaissance Economist

Caltech BS, Yale SOM MBA, Harvard MS. I write about Economics, Statistics, and Data. Very active on Twitter! @bradchattergoon