Aggregating user ratings

Part I: Introduction

Bartek Borkowski
Fresha Engineering
3 min readJul 20, 2020

--

Photo by Florian Schmetz on Unsplash

Many modern services allow end-user to rate bought goods or services. Apart from text review — text box allowing you to provide any arbitrary feedback — usually there is some kind of numerical or boolean rating. Text reviews are valuable on their own, but when it comes to ratings it is most convenient to aggregate them. From a business analysis perspective visualizing distribution might be interesting but for sorting, recommending, or presentation purposes they are most useful when condensed to a single numerical value.

Gathering feedback using a k-star or yes/no system is a simplification. It forces people to project their full experience which is a large set of random variables into a short, one-dimensional scale. Projection is purely subjective, so two people reviewing the same thing might give different answers. This transformation is lossy and complicated enough to be non-deterministic, which is a source of variance.

In Fresha users are reviewing services booked with our system. Since every interaction between service supplier and customer is unique, even if the review would be fully deterministic variance would still be present. We’re dealing with the data that has many sources of uncertainty. There is no “real” estimator that we’re looking for— the score is a convenient value that can be compared and allows ordering of the reviewed venues.

The range of statistical methods can be applied to aggregate set of gathered ratings into a single value that best represents the actual quality of services provided.

Simple arithmetic mean tend to be unstable until enough reviews are gathered. We can alleviate this instability by tracking standard deviation along with the mean point and approximating rating with 𝜇-a𝜎, where 𝜇 is the mean value, 𝜎 is the standard deviation, and a is an arbitrary parameter related to the probability that actual rating is better than our estimate. Latter approach favors sets that have more ratings and less dispersion.

Such an approach provides a value that can be used for sorting but probably shouldn’t be directly presented. A low rating count might produce a score lower than the lowest of the recorded responses. It’s fine not to promote a service with a couple of reviews, but users seeing a single 5-star rating and the resulting score of 3 might get confused and lose trust in the scoring system.

Calculating standard deviation assumes that ratings follow a normal distribution. Inspect your data to verify if that’s the case (it isn’t in Fresha). The arithmetic mean will yield quite meaningless results if the distribution follows a less-common pattern (e.g. polarized results: a similar number of very high and very low ratings).

More complex models can also assume that ratings follow a distribution and calculate the most probable parameters that describe it with a certain credibility level (see: https://www.evanmiller.org/ranking-items-with-star-ratings.html).

Analysis of the existing solutions pointed at functional requirements that we’ve decided to derive while being able to process the data in a streaming manner. The score should:

  • gradually “forget” older data points (since the value that we’re trying to estimate is not constant in time)
  • assign lower values to the sets with low element count, especially when recent elements count is low

Very roughly our calculation could be represented with the formula:

Some methods assign higher values based on central tendency strength, but we’ve decided not to include variance into account due to characteristics of the observed data.

Most definitions of averaging and popularity decay functions are parameterized. These parameters are knobs that allow tuning behavior of our rating algorithm. Gather some production data to inspect produced sorting orders for different parameter sets. If chosen aggregates don’t provide enough flexibility, additional values can be collected to determine spread, confidence, and popularity.

In the consecutive parts, we’ll go into details of building mechanism that satisfies our functional requirements:

  • part II introduces the idea of stream processing and decaying window algorithm,
  • part III extends the stream processor to take variable time intervals into account,
  • part IV adds the popularity-based decay function component to the equation.

In this mini-series, I’d like to show our research (of mostly charted waters) at Fresha on building streaming review accumulators.

--

--