Five Stars or Failure: How Ratings Mislead

Published in

Thinking Product

4 min readJul 22, 2016

In the 5/24/2015 Sunday New York Times, Maureen Dowd laments her Uber passenger rating of 4.2, which is apparently low enough to make it difficult for her to get a car sometimes. This makes it clear that Uber’s 1-through-5 star system doesn’t really have 5 possible ratings, it has two: 5, and everything else.

I first came across a system like this when taking my car to the dealership for service. When I picked it up the technician mentioned that I would likely get a phone call asking me to rate the service I had received on a scale of 1–5 stars, and, as he cheerfully pointed out, “Anything less than five stars is considered failure, so I hope I’ve given you five-star service today.”

No Room for Nuance

My wife doesn’t like to give every Uber driver a 5. She considers a 4 to be indicative of Uber’s generally very good service, and likes to reserve a 5 for drivers who are really excellent (because they have water bottles in the back, or get the door for her, or are in other ways outstanding). But because of the way Uber’s system works, it’s really not fair for her to give a very-good-but-not-excellent driver a 4 — that’s failure in Uber’s book. In a system with only one “good” rating, there’s no place for “great.”

Noisy Results

At least with a 5-or-nothing system, it’s pretty clear what everyone is expecting. If Uber declared that 3 is “acceptable,” 4 is “above average,” and 5 is “amazing,” there would still be some users who give a 5 when they are happy, and 4 or less when they are unhappy, and that would mess up the scale. The attempt to collect a more nuanced rating runs the risk of confusing everyone and giving a noisy, less informative result.

Lack of Clarity

Even where a 5-star system works more or less as intended, the scale can be unclear. Netflix’s 5-star system is an example: it’s easy to find films rated near 5 stars, and near 1 star. But that doesn’t mean it’s clear. Does 3 stars mean average quality, or is the scale centered closer to 4 stars? Does the star rating assess the film on an absolute scale, or within its genre? Roger Ebert said that he awarded stars based on how well it fulfilled its role within its genre: The Blair Witch Project clearly isn’t the same sort of film as Apocalypse Now, but he gave them both 4 stars.

Solution: use color or other visual aspects to indicate the middle of the scale

This Happy or Not feedback collector I found at Keflavik airport in Iceland uses a range of colors and happy/sad faces to cue the user to what the scale is. (Yes, I take photos of user interfaces I find interesting)

Solution: break the review into two steps

This works much better on the web or in an app. First, just ask if the user is satisfied, yes or no. Then follow up with a rating of how happy or unhappy they are, or with a request for clarification. This could clarify the unclear scale of Netflix’s star ratings.

Solution: text descriptions

On a five-step scale, something like:

Very unhappy
Unhappy
Neutral
Happy
Very happy

This barely works with five steps — is “neutral” the default, or is “happy?” — and more is worse. The often-used ten-step scale where only the very bottom and top are given descriptions is a complete failure — what’s the difference between a 6 and a 7 on a ten-step scale from “Completely Dissatisfied” to “Completely Satisfied?”

Solution: a minus-plus system

This is a simple alternative to the star system, similar to the Happy or Not system pictured above. Present the user with a simple — — + + like so:

This would have to be tested to see how well it works, and if people understand it, but it seems likely to parse even more quickly than the smiley faces used by Happy or Not.

Conclusion: don’t use stars

It’s clear that a simple five star scale is problematic in multiple ways. Part of that is simply because the task itself is challenging: getting uniform responses from a variety of people is tough. But there are fundamental issues with a star system that can likely be improved upon with a better user interface.

PS: I asked an Uber driver what my rating is. It’s 4.9, which means I can generally get a car, but also that somewhere along the way I came up short as a passenger.