How to rate complicated content?

One of the earliest and most persistent problems in digital media is how to rate a piece of content, to help a reader or viewer figure out if it’s worth their time. It’s especially interesting if this rating is public.

In practice, a rating is attached to a single piece of content. The rating needs to be clear on what it is measuring in a self-contained way — in other words, rather than ranking the content against other pieces of content in a comparison chart, say, a rating needs to stand on its own and make sense quickly with not a lot of context.

A simple rating system everyone understands. Except blue guy. What’s going on, blue guy? (Order here.)

There are lots of sites whose main job is to rate the quality of content that they don’t own— Pitchfork, Rotten Tomatoes, etc. Their rating system is public because it is the core of their business. By comparison, sites that create their own content rarely offer public ratings that show the user the relative editorial quality of individual pieces of their own content (as in: This article we wrote is better than that article we wrote). Many sites use viewcount as a back-handed way to show quality and relevance, but that’s unsatisfying and sometimes not even accurate.

So it’s an interesting challenge to imagine how a content creator could openly rate their own content in a way that accomplishes several goals:

• drives users to the content that is objectively the best

• drives users to the most relevant content for them — which might not be the objectively best thing

Behind these two goals are questions like:

• how do we rate a wide breadth of content? Could one rating style equally work for a movie review, a long news story, and a video short?

• what behavior, what decisions will our rating help the user make? For example: 1. to watch vs. to leave the site. 2. to watch vs. to further explore the site. 3. to trust vs. mistrust the content

• do we personify the rater? In other words, should our rating come from an aggregation of “users like you” or from an editorial/curatorial authority?

• and the biggie: How do we use ratings to show what we value?

Review of options

Ratings come in many styles, from basic to complex.

1. A binary thumbs-up/down. This works for sites that host lots of content they don’t create. A simple “good versus bad” rating, aggregated from user votes, is applied to every talk in YouTube’s library. YouTube tested several other rating systems before reaching this point.

A couple notes here: YouTube’s thumbs up/down ratings can be effective because they are generated by viewers who care about the content — you don’t go on YouTube and rate videos you don’t care about. It’s also fair to assume that the people who rate a Jake Paul video are using completely different rating criteria from people who rate elevator videos. A Jake Paul fan is giving a thumbs-up to a video that’s rude, funny, transgressive and cleverly filmed, while to the elevator fan a thumbs-up might indicate appreciation for access to a rare piece of equipment, even if the filming itself is cellphone quality. These ratings leverage a fan’s expertise and expectations within niche markets, and the numbers don’t transfer across niches — if you love a 93% upvoted Jake Paul video, it doesn’t mean you’re going to love a 93% upvoted elevator video.

A simple binary rating will create a “watch / not watch” decision. In other words, a binary rating works on a content site that contains some content that’s not worth watching.

2. Numerical / star rating. A rating of only one factor (like “good versus bad”), but with more gradations. We’re used to this from Uber, where passengers rate drivers from one to five stars. Yes, I’m thinking of drivers as the content to be rated. This system requires a good explanation of what, exactly, you are rating, or the numerical scale can be baffling and even hard to use — what’s the difference between a 2 and a 3 really?

Note that Uber’s app, once you’ve rated the driver, then has a second step where the user can add more information to the driver’s star rating, the “compliment” — factors like “excellent service“ and “expert navigaton.” Square uses this system too, with a basic smiley/frowny face ranking that you then can add specifics to. In practice, a simple star rating benefits from a chance to add more detail.

Read more about Uber’s compliments system in this brilliant post from The Rideshare Guy.

3. Matrix. A rating system with two values instead of one: say, “good versus bad” AND “short versus long.” Matrixes (or matrices, whatever) allow for a richer story about the content. The two values can interact in clever ways, as in the “Approval Matrix” from New York magazine. While matrixes are most often used to compare lots of content in groups, a small matrix might also make an interesting graphical rating on an individual piece of content.

4. More complex matrixes like Altmetrics, which creates a badge or donut that can show lots of kinds of diverse measurements, color-coded to show at a glance where the content sits in a wider context. Altmetrics uses social data to show where a piece of content is being shared and discussed. I also skimmed this interesting paper on a recipe-rating system that used 6 different types of content information to create a single rating. While it seems this work is leading to a recommendation algorithm hidden from the user, a public-facing rating might also be created. This Trello proposes a color-coding system (that requires a fair amount of expertise to understand). But the key is that all these factors roll up to a single unit, even if it’s graphically complicated like the Altmetric donut.

An example of the Altmetric ‘donut’ badge for an Energy & Environmental Science article, borrowed from this press release from the Royal Society of Chemistry.

5. A collection of separate individual ratings. Online recipes are commonly rated by several factors, including the time it takes to prepare, whether the techniques required are easy or complicated, and the number of comments from people who’ve tried the recipe. It’s assumed the user will look at several factors and make their decision using the factors that matter to them. For a piece of content, a collection of ratings might include both an editorial rating and user-created, passively collected ratings like viewcount and time-on-page. (Viewcount is often displayed, while time-on-page is usually not publicly displayed. Why not?)

6. Hidden ratings and algorithms. In which you don’t see the rating itself, just the result in what you end up getting shown or served. Here’s one you could learn to write, I guess. It’s worth asking what aspects of these algorithmic ratings could be exposed as editorial content, to help users make choices and understand the choices being made for them. (With or without the behavioral-tracking data baked in; would this make the user uncomfortable, or more confident?) Some systems, like Netflix, combine a hidden ranking with public ones — the user sees the star rating for a movie, but they’re being served choices determined by a hidden algorithm.

Who is thinking about this?

This is just from a fast review — eager to hear what is missing.

There’s interesting thinking on metrics in academic communities (read the Altmetrics manifesto; read the Metric Tide landing page) as a complement to traditional peer review or a way to measure effectiveness of research.

Frederic Filloux and his News Quality Scoring Project (NQS), “aimed at finding and quantifying signals that convey content quality.” His “Signals of Editorial Quality” is a useful read.

Of course, people who write algorithms think about ratings and metrics all the time, to power the hidden choices made by their tools; some of their ratings could be made public.