Designing Interfaces for Recommender Systems
I used to build recommendation systems and now I build user interfaces. Your user interface is everything that sits between what is going on in a user’s mind and what numbers you can put in your user-item matrices that get processed into recommendations.
So obviously the UI is critical in collecting signal. I have a framework for designing these interfaces. My goal is usually to collect signal that is:
First, you want to make sure you’re collecting the right signal. Collecting a rating on a movie or time spent reading an article is predictive of a user’s affinity for that item. Whether or not they opened the comments section on a YouTube video isn’t a clear signal that they liked the video.
This is an oversimplified example, but in general your most predictive features will map to interfaces that clearly capture a user’s affinity for an item. Sometimes it’s explicit signal, such as a “like” button but at other times it’s implicit, such as time spent watching a video.
Designing the right interaction can help you collect cleaner signal. For example, if you expand text inline you’ll have a less clear idea of reading time than if you were to open it in a modal. The interface can go a long way in removing noise from the signal when it’s well designed.
One rule of thumb I have for designing UIs for signal collection is “can I explain user A’s interaction with the UI as a reason for suggesting this item to User B”. Some examples are:
- Really good: People who watched movie X watched Y
- “Watching” is the signal and it’s really predictive on long-form content like movies.
- Decent: Read by people like you
- “Reading time” is generally a good signal but can be hard to capture in some products because people leave the viewport open, have two stories in viewport, etc.
- Bad: People like you clicked through on this story
- CTR can be a bad long-term predictor of affinity because it’s subject to clickbait that people regret. Facebook newsfeed for example compensates for this by using time on site as a signal too.
The worst signals are ones that mean different things to different people.
A bad translation of the word “Upvote” into Italian could mean that when your friends in Italy like a photo it might be a lighter or heavier weight action than your friends in the US. This means the signal coming from Italy is different from the US and making predictions for US users based on Italian signals can end badly.
This problem isn’t just a cross-cultural thing. Often, content actions are just confusingly named so different people interpret it differently and thus use it differently. Say you “star” a story because you want to share it with people. Say someone else uses “star” as a way to save stories for themselves to read later. Say someone just uses it for both because that’s the only action they can find. This is a bad signal to collect because it’s ambiguous. When making predictions you know it’s intent A or intent B but it’s really hard to tell which is which.
Twitter’s old “star” (favorite) signal was a good example of this. It was originally built as a “save” but most people used it as a “like” in practice, thus conflating the signal. Since then, they’ve moved to a Like/♥.
If both intents are genuinely valid then sometimes it makes sense to have multiple content actions, e.g. Medium’s separate Recommend and Save actions.
This is a tricky balance to strike though. You can get more orthogonal signal by simply providing more content actions, e.g. Facebook Reactions. but this comes with increased product complexity, which often leads to more user decision paralysis, that can then result in lower signal overall. So be careful when trying to ramp up the volume of signal you collect and always have a preference for implicit signal (where the user doesn’t realize they’re feeding signal to the system).
The thing about collecting signal is that you want a lot of it. Sparse data leads to overfitting. One of the first things you learn about in Machine Learning is a graph like this that tells you the relationship between training data and model performance, irrespective of algorithm.
Test error gets better with training data (until a certain point) so you want to have user-item interactions that are as lightweight as possible (while keeping them predictive and unambiguous). Ideally, the interactions are so lightweight the user doesn’t even realize they’re giving signal to the system.
Some of the best signal comes from implicit actions, things that a user does that they don’t even realize is signal for the system, e.g. time spent watching a video or reading a piece of text, tapping a photo to expand it or clicking through to an article.
For example, this YouTube video only has 14,000 explicit actions but over 424k views, each with signal on how long it was watched, what parts were watched and where the dropoffs were. In this case, there’s a lot more to be learnt from the implicit signal than the explicit.
Be wary with implicit signals though, you want to make sure you’re not collecting the signals that are too superficial and not indicative of long-term engagement. For example, you can collect clickthrough rate but this may be misleading and optimize for clickbaity headlines with bad content that will increase engagement in the short term but burn out the user longer term. If you rely too much on these signals, you’ll either find yourself running really long-term holdout experiments.
Even with explicit content actions like ratings and like/dislike you want to be careful about what the language (e.g. people will click “like” more than something heavier “endorse”) and UI is (5-star systems often end up bimodal and skewed up, not that helpful in practice). Also see John Ciancutti’s answer to Is there a better alternative to the 5-star rating system?
Besides signal collection, the user interface used to present recommendations is also important because of human factors. Humans value being able to understand “why” they’re seeing a recommendation. They also disproportionately value social proof (because your friends liked this) and other signals that emotionally resonate (e.g. status, brand, “because this critic you respect liked this”) that are not always rationally optimal.
To that end, you want to make results more interpretable to humans and make them feel like they can “understand” why they’re seeing a recommendation, and this is another place where your user interface has a role to play. There’s a lot more I can go into regarding presenting the output of a recommender system, but let’s keep this article mostly to the input and save that for another time.
Feedback is super-welcome! If you enjoyed reading this feel free to share and recommend below. This post is based off an answer to a question on Quora.