Recommendation Engines Need Fairness Too!

How algorithmic fairness applies to recommendation systems beyond the classic “binary classifier” case.

Published in

Arthur AI

9 min readJul 16, 2020

A graphic of Facebook Ads. — Facebook’s civil rights audit shows they have a long way to go.

As we turn to digital sources for news and entertainment, recommendation engines are increasingly influencing the daily experience of life, especially in a world where folks are encouraged to stay indoors. These systems are not just responsible for suggesting what we read or watch for fun, but also for doling out news and political content, and for surfacing potential connections with other people online. When we talk about bias in AI systems, we often read about unintentional discrimination in ways that apply only to simple binary classifiers (e.g. in the question “Should we let this prisoner out on parole?”, there are only two potential predictions: yes, or no). Thinking about mitigating bias in recommendation engines is much more complex. In this post, we’ll briefly describe how these systems work, then surface some examples of how they can go wrong, before offering suggestions on how to detect bias and improve your users’ experience online, in a fair and thoughtful way.

Part 1: The Anatomy of a Recommender System

If you’re coming to this article as someone who regularly builds or works on recommender systems, feel free to skip this part. For those of you needing a refresher or primer on the topic, read on!

A GIF showing the tables that get filled in using estimates of whether a user will like a certain kind of content.

Recommender engines help companies predict what they think you’ll like to see. For Netflix, YouTube and other content providers, this might happen in the format of choosing which video cues next in auto-play. For a retailer like Amazon, it could be picking which items to suggest in a promotional email. At their core, recommender systems take as input two “sides” of a problem — users and items. In the case of Netflix, each user is an account, and each item is a movie. For Amazon, users are shoppers, and items are things you can buy. For YouTube, users are viewers, items are videos, and a third component are the users that create the content. You can imagine analogues with newspapers and other media sources such as the New York Times and the Wall Street Journal, music streaming services such as Spotify and Pandora, as well as social networking services such as Twitter and Facebook.

Users rate some items, but not all of them. For example, even if you binge watch shows on Netflix, it’s unlikely that you have rated even a small fraction of Netflix’s vast content catalogue, much less so when it comes to YouTube’s library, where over 300 hours of content are uploaded every minute. A recommender system’s goal is, given a user, find the items or items that will be of greatest interest to that user, under the assumption that most items have not been rated by most users. How is this done? By learning from other, similar items, similar users, and combinations of the two.

Recommender systems recommend content based on inductive biases. One common inductive bias is that users who seem similar in the past will continue to seem similar in the present and future. In the context of recommender systems, this means that users who have, for example, rated videos similarly on YouTube in the past will probably rate videos similarly moving forward. Recommendations based on this intuition might try to find similar users to a particular user, and similar pieces of content to a particular piece of content, and then combine learnings from those two neighborhoods into an individual score for that particular pairing of a user and item. By doing this for every user-content pair, the recommender system can “fill in all the blanks”, that is, predict a rating for each combination of user and piece of content. After that, it is simply a matter of picking the most highly-rated pieces of content for that customer, and serving those up as you might see in a sidebar on YouTube or a “view next” carousel on Amazon Shopping.

Part 2: What Could Go Wrong?

As we’ve discussed above, recommender engines attempt to “fill in the blanks” for a particular user by guessing at their level of interest in other topics when we only know how they feel about things they’ve already seen or read. Most recommender engines are a blend of “nearest neighbor” calculations and active rating elicitation, using a combination of supervised and unsupervised learning alongside deterministic rules that modify the selection process among the content that you could potentially recommend. To discuss some of the issues that often arise in recommender engine bias, we’ll look at a couple of examples from industry that illustrate the nuance and complexity involved.

Popularity and the Gangnam Style Problem

One of the more common issues we see in industry can be illustrated by YouTube’s spectacularly named “Gangnam Style Problem”. The problem is this: no matter what content you recommend to your user, when one looks at the potential pathways they could take from one recommendation to the next, they all lead back to whatever happens to be the most popular video that day. While this may be good news for PSY and K-pop stans worldwide, gaining traction within a recommender engine can make or break the experience for someone creating content on these platforms, where they need their content to be seen in order to survive.

More so every day, we hear about complaints from within the YouTube creator community, claiming that their channels suffer from this disparity, and that YouTube is biased against emerging artists. Thinking this through from a business perspective, it’s easy to see why this might be the case: YouTube wants to keep users on the page, and they’re more likely to do that if they can show you content that they know you’ll actually like. In fact, the less YouTube knows about how users will interact with your particular brand of content, the more risky it becomes to promote it.

One alternative that you’ll often see in industry is to combat this imbalance using explicit rules that promote newer, emerging content while also decreasing the likelihood that the “most popular” video will get recommended next. This gives a pre-set bonus to content producers to help them grow their audience, giving YouTube some time to learn more about both the quality of content and the nature of the users who interact with it. However, in doing so there is evidence that YouTube may be giving an unfair advantage to content that is more extreme or radical. This is unlikely to be intentional; AI does a very bad job of predicting things when it doesn’t have much data to go on. One thing is clear: if you aren’t watching for this problem by actively considering the newness of a creator when recommending content, you won’t know how severely your users and creators are affected.

Political Bias, Ad Targeting, and Facebook’s Civil Rights Audit

Another kind of bias that we often see within industry is a perceived bias (interestingly, from within both parties) for or against a certain type of political content. While this section will focus on ad targeting in particular, the same issue applies to organic posts among users and in recommending connections or friends on the social networks hosting the debate.

As the ACLU pointed out in their historic litigation of Facebook that resulted in a 2019 settlement, the ads that people can see on Facebook can have an impact on their lives in ways that are significant, even when difficult to quantify. Ads can surface politically charged information, but they can also highlight opportunities for career advancement, education, financial literacy, housing, and capital/credit. If one segment of the population receives these potentially-life-improving ads more so than another, this will exacerbate existing inequalities leaving our society less fair than it was before. On the opposite side, ads can be predatory, offering, for example, misinformation or outrageous interest rates that trap people in a cycle of poverty making it very hard to advance.

Political ads present perhaps the simplest way to think about auditing recommenders for bias: it’s easy to track whether you’re presenting users with an even amount of information from the Democratic and Republican parties. (This of course ignores the fact that there are multitudes of political stances, which exist on a spectrum and are not easily or cleanly defined. In the example of ad targeting, at least, we can call this “simple” mainly because it’s clear who is buying the ad to promote on Facebook, while performing the same kind of audit on organic content will be much more ambiguous and challenging.)

But what about the opportunities? The most challenging part of assessing bias in cases of “positive” vs. “negative” impact to quality of life may very well be the definition of what constitutes “positive” and “negative”. It’s not enough to simply quantify a particular ad as “financial”, for instance, because a financial ad can either be beneficial when it recommends refinancing a student loan or mortgage from a reputable lender, or harmful in the case of payday loans and other predatory financial instruments. In order to truly track whether your recommender is breaking discrimination laws by behaving in ways that impact protected classes differently, a qualitative assessment is needed on each ad, making it difficult to achieve at scale.

This need for qualitative assessment and clearly defined categorization is most evident when we think about how Facebook enables the spread of misinformation. While it seems as though defining “truth” should be easy, take it from a Philosopher that this is often an impossible task. This is precisely why Facebook, when faced with its own self-imposed civil rights audit, has been asked to step up efforts to identify and remove this misleading content, leaving it very open to partisan attacks on both sides.

Part 3: What Can We Do?

There’s no magic bullet to mitigate bias in recommender systems, but the first step to solving any problem is to quantify it. In industry, it’s been shocking at times to see the degree to which some enterprises want to keep their heads in the proverbial sand on issues of algorithmic discrimination. The old days of “fairness through unawareness” as a tactic to address bias are clearly coming to an end. In two cases specifically, with more on the horizon, federal prosecutors have opened inquiries into companies guilty of unintentional race and gender discrimination.

In what may seem counterintuitive, a necessary first step towards addressing algorithmic discrimination must be to collect protected class information like race, gender, disability etc. from users. For years, the adage has been that it’s impossible to discriminate if you’re unaware of the subject’s protected class status. Unintentional algorithmic discrimination proves that this is no longer a viable strategy, and the automated systems that govern our daily experiences are exascerbating existing inequalities through the exploitation of features that serve as proxies for protected categories like race. In recommender systems, the very content that you like and have enjoyed in the past can most certainly be, in many cases, a proxy for your race.

A second, important task will be to create categories for content and ads based on its usefulness or harmfulness. While this is a challenge, and one that will be honed over hours of careful discussion, it is not impossible to categorize certain things into buckets that can then be tracked. For instance, ads offering higher education from accredited universities can be differentiated from ads promoting for-profit certifications. While there may not be clear consensus on every item (as we see from attempts to define deepfakes or other forms of misinformation), these are debates that must be had early, often, and with transparency to the public lest these issues become swept under the rug for being “too hard to scale”. Once you have knowledge of the protected classes and the rates at which they are recommended “positive” versus “negative” context, you can calculate a metric that gets at the disparity of how your platform influences their lives.

To make one final point, the betterment of bias must be a never ending pursuit. Algorithms can be tested ad nauseum in the lab only to become biased as cultural shifts occur. Alternatively, these changes can happen when platforms adopt new segments of the population and user personas that they may not understand as well as those that have been previously defined. New categories of ads and content will appear and fade away as swiftly as our lives are influenced by the events of the day. We must ensure that, as practitioners, we bring an ethos of continual improvement to issues of algorithmic discrimination, never calling what’s been done already “good enough”.

Recommendation Engines Need Fairness Too!

How algorithmic fairness applies to recommendation systems beyond the classic “binary classifier” case.

Part 1: The Anatomy of a Recommender System

Part 2: What Could Go Wrong?

Part 3: What Can We Do?

Written by Liz O'Sullivan